Re: BigQueryIO and NumFileShards

Reuven Lax Fri, 09 Mar 2018 07:07:16 -0800

A quick clarification on how we can get rid of numFileShards (adding dev@).

numFileShards exists for the scenario of using BigQuery file loads in a
streaming setting. Writing to one file per window doesn't work at high
scale, and there is no good number to choose here that works for every
workload; hence the parameter.

A better option would be to optimize for the size of each file (e.g. aim
for each file generated to be 10MB, and write as many files needed to
maintain that). This can be done inside of BigQueryIO by forking the data
into a combiner that estimates the total data size (we can even sample if
we're worried about the performance hit here), and then dividing this by
the desired file size; the resulting number will turn into a side input to
the write step, indicating how many files to write. The fie size won't be
perfectly matched because triggers are non deterministic, but it should be
good enough for fie loads.

A desiredFileSize parameter is easier for users to reason about than a
numFileShards parameter IMO. Even more importantly, we can pick .a
reasonable default for this parameter (e.g. 10 or 20 MB) that will probably
work reasonably well for all workloads.

Reuven

On Thu, Mar 8, 2018 at 10:03 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

> It's unfortunate that we have this parameter at all - we discussed various
> ways to get rid of it with +Reuven Lax <re...@google.com> , ideally we'd
> be computing it automatically . In your case the throughput is quite modest
> and even a value of 1 should do well.
>
> Basically in this codepath we write the data to files in parallel, and
> every $triggeringFrequency we flush the files to a BigQuery load job. How
> many files to write in parallel, depends on the throughput. The fewer, the
> better, but the write throughput to a single file is limited. You can
> assume that write throughput to GCS is a few dozen MB/s per file; I assume
> 1000 events/s fits under that, depending on the event size.
>
> Actually with that in mind, we should probably just set the value to
> something like 10 or 100 which will be enough for most needs (up to about 5
> GB/s) but keep it configurable for people who need more, and eventually
> figure out a way to autoscale it.
>
> On Thu, Mar 8, 2018 at 1:50 AM Jose Ignacio Honrado <jihonra...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am using BigQueryIO from Apache Beam 2.3.0 and Scio 0.47 to load data
>> into BQ from Dataflow using jobs (Write.Method.FILE_LOADS). Here is the
>> code:
>>
>>     val timePartitioning = new
>> TimePartitioning().setField("partition_day").setType("DAY")
>>
>>     BigQueryIO.write[Event]
>>       .to("some-table")
>>       .withCreateDisposition(Write.CreateDisposition.CREATE_IF_NEEDED)
>>       .withWriteDisposition(Write.WriteDisposition.WRITE_APPEND)
>>       .withMethod(Write.Method.FILE_LOADS)
>>       .withFormatFunction((input: Event) =>
>> BigQueryType[Event].toTableRow(input))
>>       .withSchema(BigQueryType[Event].schema)
>>       .withTriggeringFrequency(Duration.standardMinutes(15))
>>       .withNumFileShards(XXX)
>>       .withTimePartitioning(timePartitioning)
>>
>> My question is related to the "numFileShards", which is a mandatory
>> parameter to set when using a "triggeringFrequency". I have been trying to
>> find information and reading the source code to understand what it does but
>> I couldn't find anything relevant.
>>
>> Considering there is gonna be a throughput of 300-1000 events per second,
>> what would be the recommended value?
>>
>> Thanks!
>>
>

Re: BigQueryIO and NumFileShards

Reply via email to