[
https://issues.apache.org/jira/browse/BEAM-2879?focusedWorklogId=318608&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318608
]
ASF GitHub Bot logged work on BEAM-2879:
----------------------------------------
Author: ASF GitHub Bot
Created on: 25/Sep/19 21:18
Start Date: 25/Sep/19 21:18
Worklog Time Spent: 10m
Work Description: steveniemitz commented on pull request #9665:
[BEAM-2879] Support writing data to BigQuery via avro
URL: https://github.com/apache/beam/pull/9665
This change enhances BigQueryIO.Write to support writing avro files rather
than json when using FILE_LOADS (STREAMING_INSERTS is unchanged).
This is semi-WIP, but I wanted to get the review up sooner to get feedback.
TODO:
- more documentation in BigQueryIO
- unit tests
### Benchmarks
Preliminary results look good. The more CPU constrained a job is, the
faster avro becomes.
My test dataset is a typical workload of ours, around 2 billion records
(~130 GB serialized) representing the result of a combine. My tests read these
records from GCS and wrote them to BigQuery. The jobs were run in dataflow
with 150 x n1-standard-2 workers.
format | time to start load job | bytes written | BQ slot time (ms)
-------|----------------------|--------------|-------------
avro | 6 m 30 s | 126 GB | 35,189,679
json | 8 m 5 s | 712 GB | 96,006,088
------------------------
- [ ] [**Choose
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [x] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
--- | --- | --- | --- | --- | --- | --- | ---
Go | [](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
| --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
| --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
Java | [](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
Python | [](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)
| --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)
| --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
XLang | --- | --- | --- | [](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/)
| --- | --- | ---
Pre-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
--- |Java | Python | Go | Website
--- | --- | --- | --- | ---
Non-portable | [](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
Portable | --- | [](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/)
| --- | ---
See
[.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md)
for trigger phrase, status and link of all Jenkins jobs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 318608)
Remaining Estimate: 0h
Time Spent: 10m
> Implement and use an Avro coder rather than the JSON one for intermediary
> files to be loaded in BigQuery
> --------------------------------------------------------------------------------------------------------
>
> Key: BEAM-2879
> URL: https://issues.apache.org/jira/browse/BEAM-2879
> Project: Beam
> Issue Type: Improvement
> Components: io-java-gcp
> Reporter: Black Phoenix
> Assignee: Steve Niemitz
> Priority: Minor
> Labels: starter
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Before being loaded in BigQuery, temporary files are created and encoded in
> JSON. Which is a costly solution compared to an Avro alternative
--
This message was sent by Atlassian Jira
(v8.3.4#803005)