Re: Per Element File Output Without writeDynamic
Hi Christopher, Thanks for clarifying. Then can you just preprocess the PCollection with a custom FlatMapElements that converts each Document into one or more smaller documents, small enough to be written into individual files? Then pair it with a unique key and follow by FileIO.writeDynamic().by(the unique key).withNumShards(1) to produce 1 file per document. On Tue, Dec 3, 2019 at 7:55 AM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Hi Eugene, > > Yes I think you've got it correct. In our use case we need to write each > Document in the PCollection to a separate file as multiple Documents in a > file will cause compilation errors and/or incorrect code to be generated by > the Thrift compiler. > > Additionally there are some Documents that are so large that we would want > them to be split. > > On Mon, Dec 2, 2019 at 9:45 PM Eugene Kirpichov wrote: > >> Hi Christopher, >> >> So, you have a PCollection, and you're writing it to files. >> FileIO.write/writeDynamic will write several Document's to each file - >> however, in your use case some of the individual Document's are so large >> that you want instead each of those large documents to be split into >> several files. >> >> Before we continue, could you confirm whether my understanding is correct? >> >> Thanks. >> >> On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen < >> christopher.lar...@quantiphi.com> wrote: >> >>> Ideally each element (document) will be written to a .thrift file so >>> that it can be compiled without further manipulation. >>> >>> But in the case of an extremely large file I think it would be nice to >>> split into smaller files. As far as splitting points go I think it could be >>> split at a point in the list of definitions. Thoughts? >>> >>> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax wrote: >>> What do you mean by shard the output file? Can it be split at any byte location, or only at specific points? On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Hi Reuven, > > We would like to write each element to one file but still allow the > runner to shard the output file which could yield more than one output > file > per element. > > On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax wrote: > >> I'm not sure I completely understand the question. Are you saying >> that you want each element to write to only one file, guaranteeing that >> two >> elements are never written to the same file? >> >> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < >> christopher.lar...@quantiphi.com> wrote: >> >>> Hi All, >>> >>> TL/DR: can you extend FileIO.sink to write one or more file per >>> element instead of one or more elements per file? >>> >>> In working with Thrift files we have found that since a .thrift file >>> needs to be compiled to generate code the order of the contents of the >>> file >>> are important (ie, the namespace and includes elements need to come >>> before >>> definitions are defined). >>> >>> The issue that we are facing is that by implementing >>> FileIO.sink we cannot determine how many Document objects are >>> written to a file since this is determined by the runner. This can >>> result >>> in more than one Document being written to a file which will cause >>> compilation errors. >>> >>> We know that this can be controlled by writeDynamic but since we >>> believe the default behavior for the connector should be to output a >>> Document to one or more files (depending on sharding) we were wondering >>> how >>> to best accomplish this. >>> >>> Best, >>> Chris >>> >>> *This message contains information that may be privileged or >>> confidential and is the property of the Quantiphi Inc and/or its >>> affiliates**. >>> It is intended only for the person to whom it is addressed. **If >>> you are not the intended recipient, any review, dissemination, >>> distribution, copying, storage or other use of all or any portion of >>> this >>> message is strictly prohibited. If you received this message in error, >>> please immediately notify the sender by reply e-mail and delete this >>> message in its **entirety* >>> >> > *This message contains information that may be privileged or > confidential and is the property of the Quantiphi Inc and/or its > affiliates**. > It is intended only for the person to whom it is addressed. **If you > are not the intended recipient, any review, dissemination, distribution, > copying, storage or other use of all or any portion of this message is > strictly prohibited. If you received this message in error, please > immediately notify the sender by reply e-mail and delete this message in > its **entirety* > -- >>> *Regards,* >>> >>> __
Re: Per Element File Output Without writeDynamic
Hi Eugene, Yes I think you've got it correct. In our use case we need to write each Document in the PCollection to a separate file as multiple Documents in a file will cause compilation errors and/or incorrect code to be generated by the Thrift compiler. Additionally there are some Documents that are so large that we would want them to be split. On Mon, Dec 2, 2019 at 9:45 PM Eugene Kirpichov wrote: > Hi Christopher, > > So, you have a PCollection, and you're writing it to files. > FileIO.write/writeDynamic will write several Document's to each file - > however, in your use case some of the individual Document's are so large > that you want instead each of those large documents to be split into > several files. > > Before we continue, could you confirm whether my understanding is correct? > > Thanks. > > On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen < > christopher.lar...@quantiphi.com> wrote: > >> Ideally each element (document) will be written to a .thrift file so that >> it can be compiled without further manipulation. >> >> But in the case of an extremely large file I think it would be nice to >> split into smaller files. As far as splitting points go I think it could be >> split at a point in the list of definitions. Thoughts? >> >> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax wrote: >> >>> What do you mean by shard the output file? Can it be split at any byte >>> location, or only at specific points? >>> >>> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen < >>> christopher.lar...@quantiphi.com> wrote: >>> Hi Reuven, We would like to write each element to one file but still allow the runner to shard the output file which could yield more than one output file per element. On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax wrote: > I'm not sure I completely understand the question. Are you saying that > you want each element to write to only one file, guaranteeing that two > elements are never written to the same file? > > On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < > christopher.lar...@quantiphi.com> wrote: > >> Hi All, >> >> TL/DR: can you extend FileIO.sink to write one or more file per >> element instead of one or more elements per file? >> >> In working with Thrift files we have found that since a .thrift file >> needs to be compiled to generate code the order of the contents of the >> file >> are important (ie, the namespace and includes elements need to come >> before >> definitions are defined). >> >> The issue that we are facing is that by implementing >> FileIO.sink we cannot determine how many Document objects are >> written to a file since this is determined by the runner. This can result >> in more than one Document being written to a file which will cause >> compilation errors. >> >> We know that this can be controlled by writeDynamic but since we >> believe the default behavior for the connector should be to output a >> Document to one or more files (depending on sharding) we were wondering >> how >> to best accomplish this. >> >> Best, >> Chris >> >> *This message contains information that may be privileged or >> confidential and is the property of the Quantiphi Inc and/or its >> affiliates**. >> It is intended only for the person to whom it is addressed. **If you >> are not the intended recipient, any review, dissemination, distribution, >> copying, storage or other use of all or any portion of this message is >> strictly prohibited. If you received this message in error, please >> immediately notify the sender by reply e-mail and delete this message in >> its **entirety* >> > *This message contains information that may be privileged or confidential and is the property of the Quantiphi Inc and/or its affiliates**. It is intended only for the person to whom it is addressed. **If you are not the intended recipient, any review, dissemination, distribution, copying, storage or other use of all or any portion of this message is strictly prohibited. If you received this message in error, please immediately notify the sender by reply e-mail and delete this message in its **entirety* >>> -- >> *Regards,* >> >> ___ >> >> *Chris Larsen* >> >> Data Engineer | Quantiphi Inc. | US and India >> >> http://www.quantiphi.com | Analytics is in our DNA >> >> USA: +1 760 504 8477 <(760)%20504-8477> >> >> >> >> *This message contains information that may be privileged or confidential >> and is the property of the Quantiphi Inc and/or its affiliates**. It is >> intended only for the person to whom it is addressed. **If you are not >> the intended recipient, any review, dissemination, distribution, copying, >> storage or other use of all or any
Re: Per Element File Output Without writeDynamic
Hi Christopher, So, you have a PCollection, and you're writing it to files. FileIO.write/writeDynamic will write several Document's to each file - however, in your use case some of the individual Document's are so large that you want instead each of those large documents to be split into several files. Before we continue, could you confirm whether my understanding is correct? Thanks. On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Ideally each element (document) will be written to a .thrift file so that > it can be compiled without further manipulation. > > But in the case of an extremely large file I think it would be nice to > split into smaller files. As far as splitting points go I think it could be > split at a point in the list of definitions. Thoughts? > > On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax wrote: > >> What do you mean by shard the output file? Can it be split at any byte >> location, or only at specific points? >> >> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen < >> christopher.lar...@quantiphi.com> wrote: >> >>> Hi Reuven, >>> >>> We would like to write each element to one file but still allow the >>> runner to shard the output file which could yield more than one output file >>> per element. >>> >>> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax wrote: >>> I'm not sure I completely understand the question. Are you saying that you want each element to write to only one file, guaranteeing that two elements are never written to the same file? On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Hi All, > > TL/DR: can you extend FileIO.sink to write one or more file per > element instead of one or more elements per file? > > In working with Thrift files we have found that since a .thrift file > needs to be compiled to generate code the order of the contents of the > file > are important (ie, the namespace and includes elements need to come before > definitions are defined). > > The issue that we are facing is that by implementing > FileIO.sink we cannot determine how many Document objects are > written to a file since this is determined by the runner. This can result > in more than one Document being written to a file which will cause > compilation errors. > > We know that this can be controlled by writeDynamic but since we > believe the default behavior for the connector should be to output a > Document to one or more files (depending on sharding) we were wondering > how > to best accomplish this. > > Best, > Chris > > *This message contains information that may be privileged or > confidential and is the property of the Quantiphi Inc and/or its > affiliates**. > It is intended only for the person to whom it is addressed. **If you > are not the intended recipient, any review, dissemination, distribution, > copying, storage or other use of all or any portion of this message is > strictly prohibited. If you received this message in error, please > immediately notify the sender by reply e-mail and delete this message in > its **entirety* > >>> *This message contains information that may be privileged or >>> confidential and is the property of the Quantiphi Inc and/or its >>> affiliates**. >>> It is intended only for the person to whom it is addressed. **If you >>> are not the intended recipient, any review, dissemination, distribution, >>> copying, storage or other use of all or any portion of this message is >>> strictly prohibited. If you received this message in error, please >>> immediately notify the sender by reply e-mail and delete this message in >>> its **entirety* >>> >> -- > *Regards,* > > ___ > > *Chris Larsen* > > Data Engineer | Quantiphi Inc. | US and India > > http://www.quantiphi.com | Analytics is in our DNA > > USA: +1 760 504 8477 <(760)%20504-8477> > > > > *This message contains information that may be privileged or confidential > and is the property of the Quantiphi Inc and/or its affiliates**. It is > intended only for the person to whom it is addressed. **If you are not > the intended recipient, any review, dissemination, distribution, copying, > storage or other use of all or any portion of this message is strictly > prohibited. If you received this message in error, please immediately > notify the sender by reply e-mail and delete this message in its * > *entirety* >
Re: Per Element File Output Without writeDynamic
Ideally each element (document) will be written to a .thrift file so that it can be compiled without further manipulation. But in the case of an extremely large file I think it would be nice to split into smaller files. As far as splitting points go I think it could be split at a point in the list of definitions. Thoughts? On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax wrote: > What do you mean by shard the output file? Can it be split at any byte > location, or only at specific points? > > On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen < > christopher.lar...@quantiphi.com> wrote: > >> Hi Reuven, >> >> We would like to write each element to one file but still allow the >> runner to shard the output file which could yield more than one output file >> per element. >> >> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax wrote: >> >>> I'm not sure I completely understand the question. Are you saying that >>> you want each element to write to only one file, guaranteeing that two >>> elements are never written to the same file? >>> >>> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < >>> christopher.lar...@quantiphi.com> wrote: >>> Hi All, TL/DR: can you extend FileIO.sink to write one or more file per element instead of one or more elements per file? In working with Thrift files we have found that since a .thrift file needs to be compiled to generate code the order of the contents of the file are important (ie, the namespace and includes elements need to come before definitions are defined). The issue that we are facing is that by implementing FileIO.sink we cannot determine how many Document objects are written to a file since this is determined by the runner. This can result in more than one Document being written to a file which will cause compilation errors. We know that this can be controlled by writeDynamic but since we believe the default behavior for the connector should be to output a Document to one or more files (depending on sharding) we were wondering how to best accomplish this. Best, Chris *This message contains information that may be privileged or confidential and is the property of the Quantiphi Inc and/or its affiliates**. It is intended only for the person to whom it is addressed. **If you are not the intended recipient, any review, dissemination, distribution, copying, storage or other use of all or any portion of this message is strictly prohibited. If you received this message in error, please immediately notify the sender by reply e-mail and delete this message in its **entirety* >>> >> *This message contains information that may be privileged or confidential >> and is the property of the Quantiphi Inc and/or its affiliates**. It is >> intended only for the person to whom it is addressed. **If you are not >> the intended recipient, any review, dissemination, distribution, copying, >> storage or other use of all or any portion of this message is strictly >> prohibited. If you received this message in error, please immediately >> notify the sender by reply e-mail and delete this message in its * >> *entirety* >> > -- *Regards,* ___ *Chris Larsen* Data Engineer | Quantiphi Inc. | US and India http://www.quantiphi.com | Analytics is in our DNA USA: +1 760 504 8477 -- _This message contains information that may be privileged or confidential and is the property of the Quantiphi Inc and/or its affiliates_. It is intended only for the person to whom it is addressed. _If you are not the intended recipient, any review, dissemination, distribution, copying, storage or other use of all or any portion of this message is strictly prohibited. If you received this message in error, please immediately notify the sender by reply e-mail and delete this message in its *entirety*___
Re: Per Element File Output Without writeDynamic
What do you mean by shard the output file? Can it be split at any byte location, or only at specific points? On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Hi Reuven, > > We would like to write each element to one file but still allow the runner > to shard the output file which could yield more than one output file per > element. > > On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax wrote: > >> I'm not sure I completely understand the question. Are you saying that >> you want each element to write to only one file, guaranteeing that two >> elements are never written to the same file? >> >> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < >> christopher.lar...@quantiphi.com> wrote: >> >>> Hi All, >>> >>> TL/DR: can you extend FileIO.sink to write one or more file per >>> element instead of one or more elements per file? >>> >>> In working with Thrift files we have found that since a .thrift file >>> needs to be compiled to generate code the order of the contents of the file >>> are important (ie, the namespace and includes elements need to come before >>> definitions are defined). >>> >>> The issue that we are facing is that by implementing >>> FileIO.sink we cannot determine how many Document objects are >>> written to a file since this is determined by the runner. This can result >>> in more than one Document being written to a file which will cause >>> compilation errors. >>> >>> We know that this can be controlled by writeDynamic but since we believe >>> the default behavior for the connector should be to output a Document to >>> one or more files (depending on sharding) we were wondering how to best >>> accomplish this. >>> >>> Best, >>> Chris >>> >>> *This message contains information that may be privileged or >>> confidential and is the property of the Quantiphi Inc and/or its >>> affiliates**. >>> It is intended only for the person to whom it is addressed. **If you >>> are not the intended recipient, any review, dissemination, distribution, >>> copying, storage or other use of all or any portion of this message is >>> strictly prohibited. If you received this message in error, please >>> immediately notify the sender by reply e-mail and delete this message in >>> its **entirety* >>> >> > *This message contains information that may be privileged or confidential > and is the property of the Quantiphi Inc and/or its affiliates**. It is > intended only for the person to whom it is addressed. **If you are not > the intended recipient, any review, dissemination, distribution, copying, > storage or other use of all or any portion of this message is strictly > prohibited. If you received this message in error, please immediately > notify the sender by reply e-mail and delete this message in its * > *entirety* >
Re: Per Element File Output Without writeDynamic
Hi Reuven, We would like to write each element to one file but still allow the runner to shard the output file which could yield more than one output file per element. On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax wrote: > I'm not sure I completely understand the question. Are you saying that you > want each element to write to only one file, guaranteeing that two elements > are never written to the same file? > > On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < > christopher.lar...@quantiphi.com> wrote: > >> Hi All, >> >> TL/DR: can you extend FileIO.sink to write one or more file per >> element instead of one or more elements per file? >> >> In working with Thrift files we have found that since a .thrift file >> needs to be compiled to generate code the order of the contents of the file >> are important (ie, the namespace and includes elements need to come before >> definitions are defined). >> >> The issue that we are facing is that by implementing >> FileIO.sink we cannot determine how many Document objects are >> written to a file since this is determined by the runner. This can result >> in more than one Document being written to a file which will cause >> compilation errors. >> >> We know that this can be controlled by writeDynamic but since we believe >> the default behavior for the connector should be to output a Document to >> one or more files (depending on sharding) we were wondering how to best >> accomplish this. >> >> Best, >> Chris >> >> *This message contains information that may be privileged or confidential >> and is the property of the Quantiphi Inc and/or its affiliates**. It is >> intended only for the person to whom it is addressed. **If you are not >> the intended recipient, any review, dissemination, distribution, copying, >> storage or other use of all or any portion of this message is strictly >> prohibited. If you received this message in error, please immediately >> notify the sender by reply e-mail and delete this message in its * >> *entirety* >> > -- _This message contains information that may be privileged or confidential and is the property of the Quantiphi Inc and/or its affiliates_. It is intended only for the person to whom it is addressed. _If you are not the intended recipient, any review, dissemination, distribution, copying, storage or other use of all or any portion of this message is strictly prohibited. If you received this message in error, please immediately notify the sender by reply e-mail and delete this message in its *entirety*___
Re: Per Element File Output Without writeDynamic
I'm not sure I completely understand the question. Are you saying that you want each element to write to only one file, guaranteeing that two elements are never written to the same file? On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Hi All, > > TL/DR: can you extend FileIO.sink to write one or more file per element > instead of one or more elements per file? > > In working with Thrift files we have found that since a .thrift file needs > to be compiled to generate code the order of the contents of the file are > important (ie, the namespace and includes elements need to come before > definitions are defined). > > The issue that we are facing is that by implementing FileIO.sink > we cannot determine how many Document objects are written to a file since > this is determined by the runner. This can result in more than one Document > being written to a file which will cause compilation errors. > > We know that this can be controlled by writeDynamic but since we believe > the default behavior for the connector should be to output a Document to > one or more files (depending on sharding) we were wondering how to best > accomplish this. > > Best, > Chris > > *This message contains information that may be privileged or confidential > and is the property of the Quantiphi Inc and/or its affiliates**. It is > intended only for the person to whom it is addressed. **If you are not > the intended recipient, any review, dissemination, distribution, copying, > storage or other use of all or any portion of this message is strictly > prohibited. If you received this message in error, please immediately > notify the sender by reply e-mail and delete this message in its * > *entirety* >
Per Element File Output Without writeDynamic
Hi All, TL/DR: can you extend FileIO.sink to write one or more file per element instead of one or more elements per file? In working with Thrift files we have found that since a .thrift file needs to be compiled to generate code the order of the contents of the file are important (ie, the namespace and includes elements need to come before definitions are defined). The issue that we are facing is that by implementing FileIO.sink we cannot determine how many Document objects are written to a file since this is determined by the runner. This can result in more than one Document being written to a file which will cause compilation errors. We know that this can be controlled by writeDynamic but since we believe the default behavior for the connector should be to output a Document to one or more files (depending on sharding) we were wondering how to best accomplish this. Best, Chris -- _This message contains information that may be privileged or confidential and is the property of the Quantiphi Inc and/or its affiliates_. It is intended only for the person to whom it is addressed. _If you are not the intended recipient, any review, dissemination, distribution, copying, storage or other use of all or any portion of this message is strictly prohibited. If you received this message in error, please immediately notify the sender by reply e-mail and delete this message in its *entirety*___