Re: LEFT JOINS same data twice?

SGreen Fri, 15 Apr 2005 11:41:32 -0700

"Chris Knipe" <[EMAIL PROTECTED]> wrote on 04/15/2005 01:20:33 PM:


> > I eliminated your GROUP BY clause because you weren't actually 
GROUPing
> > anything. You weren't looking for a MIN(), a MAX(), an AVG(), or 
anything
> > else that GROUP BY was intended to be used for. That method of
> > de-duplicating results will return potentially misleading information 
as
> > it will randomly pick just one record from all of the records that fit
> > within that "group". This way, by using DISTINCT, you can see if 
things
> > change mid-flight or not (which is either an indication of bad data or
> > some other troubles).
> 
> Ok, again I should have explained this.  The reason why FlightData is so 

> huge and growing so rapidly, is because we take a snapshot every minute 
of 
> every single flight we monitor.  I therefore group by a time stamp 
and/or 
> FlightTracker (Depending what data I need), so that I can only get the 
data 
> returned each time the data (plane's position) actually changed.  A 
quick 
> example:

I remember you mentioning that in an earlier post. Sorry I didn't remember 
it for my reply. So you only want to join to the most recent FlightData 
record for any flight. Subqueries are one way to go but you can't use them 
if your MySQL is older than 4.1. (Personally, I don't use them if I can 
avoid them.)

This is the "groupwise maximum" problem that is a FAQ and is documented in 
the manual. (However, I think you already read those)

> 
> mysql> SELECT * FROM FlightData LIMIT 2\G
> *************************** 1. row ***************************
>       EntryID: 3471
>       Network: V
>     TimeStamp: 2005-04-14 07:27:42
>       Tracker: df0363ea-fe06-1027-b1dd-00e01884a424
> PilotCallSign: xxx
>      PilotCID: xxx
> PilotRealName: Abdul Henderson KDFW
>    CurHeading: 124
>        CurAlt: 34887
>         CurGS: 469
>         Plane: H/DC10/W
>   Transponder: 7207
>         QNHHg: 29.80
>         QNHMb: 1009.14
>       Enroute: 06:00:00
>          Feul: 10:00:00
>    AirportDep: KLAX
>    AirportDes: MMMX
>    AirportAlt:
>   PositionCur: 32.06073/-114.70269
>   PositionDep: 33.942536/-118.408075
>   PositionDes: 19.435278/-099.072222
>          PLGS: 480
>         PLAlt: 35000
>     PLDepTime: 640
>  PLFlightType: I
> *************************** 2. row ***************************
>       EntryID: 3472
>       Network: V
>     TimeStamp: 2005-04-14 07:37:42
>       Tracker: df0363ea-fe06-1027-b1dd-00e01884a424
> PilotCallSign: xxx
>      PilotCID: xxx
> PilotRealName: xxx
>    CurHeading: 255
>        CurAlt: 35707
>         CurGS: 470
>         Plane: B738
>   Transponder: 7207
>         QNHHg: 29.96
>         QNHMb: 1014.56
>       Enroute: 02:12:00
>          Feul: 05:30:00
>    AirportDep: KPDX
>    AirportDes: KLAX
>    AirportAlt:
>   PositionCur: 33.94045/-118.40567
>   PositionDep: 45.588722/-122.597500
>   PositionDes: 33.942536/-118.408075
>          PLGS: 350
>         PLAlt: 34000
>     PLDepTime: 220
>  PLFlightType: I
> 2 rows in set (0.00 sec)
> 
> FlightTracker therefore appears in the database a couple of hundred 
times 
> for each flight.  It is inserted each and every time we poll the flight, 

> whether the data has changed or not.  If I do not group therefore, my 
> original query would return the Deparutre / Arrival for each and every 
> occurance for FlightTracker.  I do however understand what you are 
saying, 
> and I was not aware that it would Group the GROUP / JOIN the entire 
table 
> before adding the exceptions.  This is obviously my problem as well.
> 
> Mainly, I use the TimeStamp to determine WHEN last the position of a 
flight 
> was changed (and thus I know when and what change was made in the 
flight), 
> whilst FlightTracker is used to "group" all the occurances of the 
changes in 
> the flight.
> 
>  > This command would fit between the CREATE...SELECT... and the second
> > SELECT... Even though the docs say that only 1 index is ever used per
> > query, I believe that each JOIN can also make use of an index if it 
exists
> > (this belief is based on the results of the EXPLAIN(s) of queries
> > containing JOINed tables. However, I could be completely wrong in this
> > belief.).
> >
> 
> I was seriously hoping to be able to do this effeciently with one query. 

> But what you are saying does make sense however - especially when the 
> database gets bigger and we're sitting with 1GB of flight data for 
example. 
> One question though, if we sit with (for example) with 1,000,000 records 

> instead of the odd 350K records in the database now.  Will it not take 
just 
> as long to populate the temp table?

I think you may need you to reconsider your current storage design. You 
may want two FlightData tables, one to hold the most recent entry 
(FlightDataCurrent) the other to store all of the records of a flight 
(FlightDataHistory). They can have the same table structure but they exist 
for two totally different purposes.

Run your "current status" queries against the "current" table while if you 
want in-depth history of a particular flight, use the "history" table. 
This will increase your responsiveness considerably as there should be 
only 1 record per active flight in that table.

There are several ways to approach this: 

a) INSERT everything into the history table then on a periodic basis, 
regenerate the Current table.
b) INSERT records into both tables and DELETE the old one from the history 
table.
c) INSERT records into both tables and flag the old ones in the current 
table as expired. Delete the "expired" records on a periodic basis.

Have you thought of using InnoDB. I think it would be a good fit for this 
as you would definitely be using transactions and you need to read and 
write frequently against the same table and I don't think the MyISAM model 
of locking the whole table will meet your concurrency needs.

> 
> > I do not think this technique is faster if you eliminate the WHERE 
clause
> > from the first query. But if you do have a WHERE, it should return 
much
> > faster than if you simply added a WHERE to your original query.
> 
> Ok.  That is something that I was not sure about either yes.  WHERE 
would 
> obviously make the JOINs / GROUP BY smaller in regards to the amount of 
data 
> examined and processed.  I thought perhaps LIMIT would do the same, in 
that 
> it would use only the 30 records specified in my example query. 
Obviously 
> it JOIN / GROUP the entire table first, and only then return the first 
30 
> records - which is indeed a big difference.
> 
> mysql> SELECT FlightData.PilotCallSign AS FLNum, FlightLogs.Tracker 
> AS Tracker,         FlightLogs.TimeStart AS TimeDep, Departure.IATA 
> AS AirpDep,         Arrival.IATA AS AirpDes,         FlightLogs.Plane AS 

> Plane    FROM FlightLogs    LEFT JOIN Airports AS Departure ON 
> FlightLogs.AirpDep=Departure.ICAO    LEFT JOIN Airports AS Arrival ON 
> FlightLogs.AirpDes=Arrival.ICAO    LEFT JOIN FlightData ON 
> FlightLogs.Tracker=FlightData.Tracker   GROUP BY FlightData.Tracker 
ORDER 
> BY FlightLogs.TimeStart   LIMIT 30;
> +---------+--------------------------------------
> +---------------------+---------+---------+----------+
> | FLNum   | Tracker                              | TimeDep             | 

> AirpDep | AirpDes | Plane    |
> +---------+--------------------------------------
> +---------------------+---------+---------+----------+
> <SNIP>
> +---------+--------------------------------------
> +---------------------+---------+---------+----------+
> 30 rows in set (14.43 sec)
> 
> mysql> SELECT FlightData.PilotCallSign AS FLNum, FlightLogs.Tracker 
> AS Tracker,         FlightLogs.TimeStart AS TimeDep, Departure.IATA 
> AS AirpDep,         Arrival.IATA AS AirpDes,         FlightLogs.Plane AS 

> Plane    FROM FlightLogs    LEFT JOIN Airports AS Departure ON 
> FlightLogs.AirpDep=Departure.ICAO    LEFT JOIN Airports AS Arrival ON 
> FlightLogs.AirpDes=Arrival.ICAO    LEFT JOIN FlightData ON 
> FlightLogs.Tracker=FlightData.Tracker WHERE FlightLogs.CCID='927968' 
GROUP 
> BY FlightData.Tracker   ORDER BY FlightLogs.TimeStart;
> +--------+--------------------------------------
> +---------------------+---------+---------+--------+
> | FLNum  | Tracker                              | TimeDep             | 
> AirpDep | AirpDes | Plane  |
> +--------+--------------------------------------
> +---------------------+---------+---------+--------+
> <SNIP>
> +--------+--------------------------------------
> +---------------------+---------+---------+--------+
> 2 rows in set (0.02 sec)
> 
> MAJOR difference indeed ;)  I'll check out the temp tables a bit later 
after 
> I had some sleep.  I think I'll need it for what I'm going to be 
exploring 
> here... hehehe  One last question... A sub select may not perhaps speed 
> things up as well?
> 
> --
> Chris.
> 

Getting any data out of a billion row table (1 million rows of FlightData 
x rows of Flights x Airports x Airports) is always going to take some 
time. If you can limit what your most frequently run queries need to query 
against, you will improve your performance considerably. OLTP and OLAP 
databases have almost mutually exclusive designs. The long-term history of 
a flight resembles an OLAP problem while the "what's going on right now" 
query is more of an OLTP problem. I honestly think that splitting your 
FlightData table will be a better design in the long run. Yes, it means 
more work when you INSERT a new record but I believe you will be saving 
more time on your queries than you will be losing through the extra 
effort. 

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Re: LEFT JOINS same data twice?

Reply via email to