Re: [influxdb] Re: newbie question: how to deal with data batches where data can be from a long time in the past

Sean Beckett Fri, 11 Nov 2016 08:18:20 -0800

Yes, John, your understanding is good.

The only way to get a CQ to run on data with old timestamps is to have the
RESAMPLE FOR clause be larger than the maximum delta between now() and the
timestamp. With your use case, a RESAMPLE FOR 24h on a GROUP BY time(2m)
query is indeed a tremendous amount of wasted effort.


You are also correct that tags cannot be passed to functions, and tag
values must be strings. Also, storing a timestamp as a tag guarantees you
will have series cardinality
<https://docs.influxdata.com/influxdb/v1.0/troubleshooting/frequently-asked-questions/#why-does-series-cardinality-matter>
issues. Each unique tag value creates a new series in the system. Tags are
intended to be for metadata that doesn't change frequently. Anything with
an infinite number of possible values should be stored as a field.

Using the CLI, you can script a cron job equivalent to run a large batch
query and do the downsampling you need. The frequency with which it runs
should be the maximum amount of uncertainty you can tolerate. Given that
the data can arrive as much as 24 hours late, I assume something like once
an hour would be appropriate, and that shouldn't be too hard on the system.
I would schedule the queries on a non-zero time boundary, as many system
tasks automatically kick off at xx:00. Set the query to run at something
like 17 minutes after the hour, to avoid overlap with internal housekeeping
tasks.

You could store the insertion_timestamp as a number in a field, and then
reference it in the WHERE clause to exclude records with a more recent
timestamp. I suspect it will be faster to simply resample everything, since
fields in the WHERE clause require every point to be scanned, anyway.

On Fri, Nov 11, 2016 at 12:31 AM, John Burk <[email protected]> wrote:

> Think I may have found a solution:
>
> If you need to constantly insert data with timestamps in the past and you
>> want to automatically downsample it, you can script the CLI standalone
>> query function
>> <https://www.google.com/url?q=https%3A%2F%2Fdocs.influxdata.com%2Finfluxdb%2Fv0.13%2Ftools%2Fshell%2F%23execute-an-influxql-command-and-quit-with-execute&sa=D&sntz=1&usg=AFQjCNFtMsYtWJjh-W4DuXLfexofg_rYEw>
>>  with
>> an INTO query covering the appropriate historical range.
>
>
> I was initially concerned that I couldn't use the "collection timestamp"
> (which can be any amount of time in the past) since it would fall outside
> of the CQ's time range unless I covered my worst-case scenario with an
> inappropriately-large time range in the RESAMPLE FOR interval.  And I
> couldn't use the "collection timestamp" as a tag, since learned through
> more reading that tags can only be strings, right?
>
> It looks like the approach I should use is to manage it in my ETL
> operation, and if it looks like I'm dealing with a timestamp outside of the
> "normal" range of my CQ, I fire off an INTO query that can cover
> "appropriate historical range".
>
> Am I reading all this correctly?
>
>
> On Thursday, November 10, 2016 at 6:30:50 PM UTC-8, John Burk wrote:
>>
>> Apologies in advance, long time SQL, completely new to InfluxDB, working
>> through the concepts.
>>
>> Some background first:
>>
>> I have a data collector which runs at clients' sites and gathers a metric
>> sample every minute, stores in a local journal, and then (barring issues
>> with either internet connectivity or my cloud-based data collection service
>> being unavailable) transmits the batch of records every 15 minutes.  If the
>> collector is unable to transmit the batch after gathering 15 records, it
>> continues to accumulate records in the journal until such time as it can
>> successfully post the entire batch.
>>
>> The batch can be of any size, so far we've seen clients experience up 12
>> hours of connectivity issues, and then successfully transmit the batch of
>> 700+ records.  I imagine someday we'll see batches that contain 24 hours'
>> worth, or more.  The point is, the batch can be of any size, and the oldest
>> record in the batch can be at any point in the past.
>>
>> On my end, I store the records in a relational database with two
>> different timestamps:
>>
>>    - the insertion timestamp (when the service received the batch, the
>>    "reporting" timestamp)
>>    - the sample timestamp (timestamp when the single metric record was
>>    collected)
>>
>> I'm storing these records in an RDBMS, but want to do an
>> extract/transform/load into InfluxDB for charting data.
>>
>> So when doing the RDBMS -> InfluxDB ETL, I can just pull from the RDBMS
>> based on WHERE insertion_timestamp > my_last_ETL_query.  All well and
>> good.
>>
>> When everything's going well, all the new datapoints are from the last 15
>> minutes.  But any ETL operation may end up getting recently-inserted data
>> (recently inserted into the RDBMS) whose sample timestamps can be from
>> quite a while back in the past.
>>
>> I'm looking at how to downsample this data with CQ's, but I don't want to
>> use a ridiculously long duration literal in a RESAMPLE FOR in order to
>> cover my worst-case scenario; I expect that CQ's get more expensive the
>> larger the time range gets relative to the GROUP BY interval.  One of the
>> CQ's I want to write will just downsample GROUP BY time(2m), a factor of
>> 1/2, so a 12-hour (or longer) time range just seems wrong.
>>
>> I had thought about not specifying the timestamp when doing the insertion
>> into InfluxDB and using the sample timestamp as a tag so that the CQ could
>> tell what was recently inserted, but as far as I can tell you can't perform
>> functions on tags.  Is this correct?
>>
>>
>> --
> Remember to include the version number!
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxData" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/influxdb/a03d511a-5166-475f-8403-4489cc0c8f02%40googlegroups.com
> <https://groups.google.com/d/msgid/influxdb/a03d511a-5166-475f-8403-4489cc0c8f02%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CALGqCvMPi4aZDkS%3DP5XPuP%3DBmcUx4ffX-9aZ9DQCiAtMZHY7%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Re: newbie question: how to deal with data batches where data can be from a long time in the past

Reply via email to