Hi Christopher, So, you have a PCollection<Document>, and you're writing it to files. FileIO.write/writeDynamic will write several Document's to each file - however, in your use case some of the individual Document's are so large that you want instead each of those large documents to be split into several files.
Before we continue, could you confirm whether my understanding is correct? Thanks. On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Ideally each element (document) will be written to a .thrift file so that > it can be compiled without further manipulation. > > But in the case of an extremely large file I think it would be nice to > split into smaller files. As far as splitting points go I think it could be > split at a point in the list of definitions. Thoughts? > > On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax <re...@google.com> wrote: > >> What do you mean by shard the output file? Can it be split at any byte >> location, or only at specific points? >> >> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen < >> christopher.lar...@quantiphi.com> wrote: >> >>> Hi Reuven, >>> >>> We would like to write each element to one file but still allow the >>> runner to shard the output file which could yield more than one output file >>> per element. >>> >>> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax <re...@google.com> wrote: >>> >>>> I'm not sure I completely understand the question. Are you saying that >>>> you want each element to write to only one file, guaranteeing that two >>>> elements are never written to the same file? >>>> >>>> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < >>>> christopher.lar...@quantiphi.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> TL/DR: can you extend FileIO.sink<T> to write one or more file per >>>>> element instead of one or more elements per file? >>>>> >>>>> In working with Thrift files we have found that since a .thrift file >>>>> needs to be compiled to generate code the order of the contents of the >>>>> file >>>>> are important (ie, the namespace and includes elements need to come before >>>>> definitions are defined). >>>>> >>>>> The issue that we are facing is that by implementing >>>>> FileIO.sink<Document> we cannot determine how many Document objects are >>>>> written to a file since this is determined by the runner. This can result >>>>> in more than one Document being written to a file which will cause >>>>> compilation errors. >>>>> >>>>> We know that this can be controlled by writeDynamic but since we >>>>> believe the default behavior for the connector should be to output a >>>>> Document to one or more files (depending on sharding) we were wondering >>>>> how >>>>> to best accomplish this. >>>>> >>>>> Best, >>>>> Chris >>>>> >>>>> *This message contains information that may be privileged or >>>>> confidential and is the property of the Quantiphi Inc and/or its >>>>> affiliates**. >>>>> It is intended only for the person to whom it is addressed. **If you >>>>> are not the intended recipient, any review, dissemination, distribution, >>>>> copying, storage or other use of all or any portion of this message is >>>>> strictly prohibited. If you received this message in error, please >>>>> immediately notify the sender by reply e-mail and delete this message in >>>>> its **entirety* >>>>> >>>> >>> *This message contains information that may be privileged or >>> confidential and is the property of the Quantiphi Inc and/or its >>> affiliates**. >>> It is intended only for the person to whom it is addressed. **If you >>> are not the intended recipient, any review, dissemination, distribution, >>> copying, storage or other use of all or any portion of this message is >>> strictly prohibited. If you received this message in error, please >>> immediately notify the sender by reply e-mail and delete this message in >>> its **entirety* >>> >> -- > *Regards,* > > ___________________________________________ > > *Chris Larsen* > > Data Engineer | Quantiphi Inc. | US and India > > http://www.quantiphi.com | Analytics is in our DNA > > USA: +1 760 504 8477 <(760)%20504-8477> > ____________________________________________ > > > *This message contains information that may be privileged or confidential > and is the property of the Quantiphi Inc and/or its affiliates**. It is > intended only for the person to whom it is addressed. **If you are not > the intended recipient, any review, dissemination, distribution, copying, > storage or other use of all or any portion of this message is strictly > prohibited. If you received this message in error, please immediately > notify the sender by reply e-mail and delete this message in its * > *entirety* >