Re: [I] [Feature Request]: Provide the Python equivalent to Java's BigQueryIO `writeProtos` [beam]

via GitHub Tue, 15 Oct 2024 07:39:14 -0700


lazarillo commented on issue #32781:
URL: https://github.com/apache/beam/issues/32781#issuecomment-2414124334


   I actually have the following notes in my internal repo, which might give 
some idea as to why this sort of feature is useful.
   
   As I'm sure anyone reading this thread is aware, the way that Google handles 
timestamps across all of its products is all over the place.
   
   I actually have notes within our repo to make sure that I (or anyone else) 
does not forget all of these idiosyncracies:
   
   ---
   
   ## Working with timestamps in Dataflow (and BigQuery and Pub/Sub and 
protobufs)
   
   Timestamp defaults are *all over the place* within the Googleverse.
   
   - When working with protobufs, the most common is to use a protobuf 
`Timestamp` message, which has the fields `seconds` and `nanos`. (Note, it is 
`nanos`, not `nanoseconds`).
   - When working with the Pub/Sub-to-BigQuery direct subscription connector, 
you _cannot use_ a protobuf `Timestamp` message, and you must instead provide 
an integer which is the _number of microseconds_ since epoch (1970-01-01). Yes, 
**microseconds**.
   - When working within Dataflow itself (in Python), since there is no direct 
means to `WriteToBigQuery` with a protobuf message, you have to first convert 
to JSON or a `dict` or something similar. In this case (when working with JSON, 
or a `dict`, which is pushed to a JSON representation behind the scenes as best 
I can tell), the timestamp must either be an integer which is the _number of 
seconds_ since epoch **or it can be a float which is the _number of 
milliseconds_ since epoch**. So the expected resolution depends upon the 
primitive type! Best to avoid using a numeric timestamp in this case.
   
   **Yes, you read that correctly, when working with timestamps in the 
Googleverse, it must be represented either as nanoseconds, microseconds, 
milliseconds, or seconds, depending upon the resource you're talking to and the 
data type of the timestamp itself (and maybe the language you're using).**
   
   Best practice to deal with all of this:
   
   - If the item will ever be written directly into BigQuery from Pub/Sub via 
the direct subscription, **you have no choice, it must be an integer 
representing the number of _microseconds_ since epoch**.
   - If the item is not written directly to BigQuery (it is processed in 
Dataflow), represent it as a proper protobuf `Timestamp` because when this is 
converted to a JSON or `dict`, the protobuf conversion code manages this by 
converting it to a string representation using the proper scale.
       - This creates a larger message over the wire, but it prevents any 
accidental errors depending upon source data type being `int` or `float`.
   
   ---
   
   So the addition of this feature will _drastically_ simplify our workflow of 
type checking, etc.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Feature Request]: Provide the Python equivalent to Java's BigQueryIO `writeProtos` [beam]

Reply via email to