A quick clarification on how we can get rid of numFileShards (adding dev@).
numFileShards exists for the scenario of using BigQuery file loads in a streaming setting. Writing to one file per window doesn't work at high scale, and there is no good number to choose here that works for every workload; hence the parameter. A better option would be to optimize for the size of each file (e.g. aim for each file generated to be 10MB, and write as many files needed to maintain that). This can be done inside of BigQueryIO by forking the data into a combiner that estimates the total data size (we can even sample if we're worried about the performance hit here), and then dividing this by the desired file size; the resulting number will turn into a side input to the write step, indicating how many files to write. The fie size won't be perfectly matched because triggers are non deterministic, but it should be good enough for fie loads. A desiredFileSize parameter is easier for users to reason about than a numFileShards parameter IMO. Even more importantly, we can pick .a reasonable default for this parameter (e.g. 10 or 20 MB) that will probably work reasonably well for all workloads. Reuven On Thu, Mar 8, 2018 at 10:03 PM Eugene Kirpichov <kirpic...@google.com> wrote: > It's unfortunate that we have this parameter at all - we discussed various > ways to get rid of it with +Reuven Lax <re...@google.com> , ideally we'd > be computing it automatically . In your case the throughput is quite modest > and even a value of 1 should do well. > > Basically in this codepath we write the data to files in parallel, and > every $triggeringFrequency we flush the files to a BigQuery load job. How > many files to write in parallel, depends on the throughput. The fewer, the > better, but the write throughput to a single file is limited. You can > assume that write throughput to GCS is a few dozen MB/s per file; I assume > 1000 events/s fits under that, depending on the event size. > > Actually with that in mind, we should probably just set the value to > something like 10 or 100 which will be enough for most needs (up to about 5 > GB/s) but keep it configurable for people who need more, and eventually > figure out a way to autoscale it. > > On Thu, Mar 8, 2018 at 1:50 AM Jose Ignacio Honrado <jihonra...@gmail.com> > wrote: > >> Hi, >> >> I am using BigQueryIO from Apache Beam 2.3.0 and Scio 0.47 to load data >> into BQ from Dataflow using jobs (Write.Method.FILE_LOADS). Here is the >> code: >> >> val timePartitioning = new >> TimePartitioning().setField("partition_day").setType("DAY") >> >> BigQueryIO.write[Event] >> .to("some-table") >> .withCreateDisposition(Write.CreateDisposition.CREATE_IF_NEEDED) >> .withWriteDisposition(Write.WriteDisposition.WRITE_APPEND) >> .withMethod(Write.Method.FILE_LOADS) >> .withFormatFunction((input: Event) => >> BigQueryType[Event].toTableRow(input)) >> .withSchema(BigQueryType[Event].schema) >> .withTriggeringFrequency(Duration.standardMinutes(15)) >> .withNumFileShards(XXX) >> .withTimePartitioning(timePartitioning) >> >> My question is related to the "numFileShards", which is a mandatory >> parameter to set when using a "triggeringFrequency". I have been trying to >> find information and reading the source code to understand what it does but >> I couldn't find anything relevant. >> >> Considering there is gonna be a throughput of 300-1000 events per second, >> what would be the recommended value? >> >> Thanks! >> >