[influxdb] newbie question: how to deal with data batches where data can be from a long time in the past

John Burk Thu, 10 Nov 2016 18:31:10 -0800

Apologies in advance, long time SQL, completely new to InfluxDB, working 
through the concepts.


Some background first:

I have a data collector which runs at clients' sites and gathers a metric 
sample every minute, stores in a local journal, and then (barring issues 
with either internet connectivity or my cloud-based data collection service 
being unavailable) transmits the batch of records every 15 minutes.  If the 
collector is unable to transmit the batch after gathering 15 records, it 
continues to accumulate records in the journal until such time as it can 
successfully post the entire batch.

The batch can be of any size, so far we've seen clients experience up 12 
hours of connectivity issues, and then successfully transmit the batch of 
700+ records.  I imagine someday we'll see batches that contain 24 hours' 
worth, or more.  The point is, the batch can be of any size, and the oldest 
record in the batch can be at any point in the past.

On my end, I store the records in a relational database with two different 
timestamps:

   - the insertion timestamp (when the service received the batch, the 
   "reporting" timestamp)
   - the sample timestamp (timestamp when the single metric record was 
   collected)

I'm storing these records in an RDBMS, but want to do an 
extract/transform/load into InfluxDB for charting data.

So when doing the RDBMS -> InfluxDB ETL, I can just pull from the RDBMS 
based on WHERE insertion_timestamp > my_last_ETL_query.  All well and good.

When everything's going well, all the new datapoints are from the last 15 
minutes.  But any ETL operation may end up getting recently-inserted data 
(recently inserted into the RDBMS) whose sample timestamps can be from 
quite a while back in the past.

I'm looking at how to downsample this data with CQ's, but I don't want to 
use a ridiculously long duration literal in a RESAMPLE FOR in order to 
cover my worst-case scenario; I expect that CQ's get more expensive the 
larger the time range gets relative to the GROUP BY interval.  One of the 
CQ's I want to write will just downsample GROUP BY time(2m), a factor of 
1/2, so a 12-hour (or longer) time range just seems wrong.

I had thought about not specifying the timestamp when doing the insertion 
into InfluxDB and using the sample timestamp as a tag so that the CQ could 
tell what was recently inserted, but as far as I can tell you can't perform 
functions on tags.  Is this correct?


-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/dc6d5328-fffe-4ef1-a249-f86d73b122bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[influxdb] newbie question: how to deal with data batches where data can be from a long time in the past

Reply via email to