Re: Per Element File Output Without writeDynamic

2019-12-03 Thread Eugene Kirpichov
Hi Christopher,

Thanks for clarifying. Then can you just preprocess the PCollection with a
custom FlatMapElements that converts each Document into one or more smaller
documents, small enough to be written into individual files? Then pair it
with a unique key and follow by FileIO.writeDynamic().by(the unique
key).withNumShards(1) to produce 1 file per document.

On Tue, Dec 3, 2019 at 7:55 AM Christopher Larsen <
christopher.lar...@quantiphi.com> wrote:

> Hi Eugene,
>
> Yes I think you've got it correct. In our use case we need to write each
> Document in the PCollection to a separate file as multiple Documents in a
> file will cause compilation errors and/or incorrect code to be generated by
> the Thrift compiler.
>
> Additionally there are some Documents that are so large that we would want
> them to be split.
>
> On Mon, Dec 2, 2019 at 9:45 PM Eugene Kirpichov  wrote:
>
>> Hi Christopher,
>>
>> So, you have a PCollection, and you're writing it to files.
>> FileIO.write/writeDynamic will write several Document's to each file -
>> however, in your use case some of the individual Document's are so large
>> that you want instead each of those large documents to be split into
>> several files.
>>
>> Before we continue, could you confirm whether my understanding is correct?
>>
>> Thanks.
>>
>> On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen <
>> christopher.lar...@quantiphi.com> wrote:
>>
>>> Ideally each element (document) will be written to a .thrift file so
>>> that it can be compiled without further manipulation.
>>>
>>> But in the case of an extremely large file I think it would be nice to
>>> split into smaller files. As far as splitting points go I think it could be
>>> split at a point in the list of definitions. Thoughts?
>>>
>>> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax  wrote:
>>>
 What do you mean by shard the output file? Can it be split at any byte
 location, or only at specific points?

 On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen <
 christopher.lar...@quantiphi.com> wrote:

> Hi Reuven,
>
> We would like to write each element to one file but still allow the
> runner to shard the output file which could yield more than one output 
> file
> per element.
>
> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax  wrote:
>
>> I'm not sure I completely understand the question. Are you saying
>> that you want each element to write to only one file, guaranteeing that 
>> two
>> elements are never written to the same file?
>>
>> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
>> christopher.lar...@quantiphi.com> wrote:
>>
>>> Hi All,
>>>
>>> TL/DR: can you extend FileIO.sink to write one or more file per
>>> element instead of one or more elements per file?
>>>
>>> In working with Thrift files we have found that since a .thrift file
>>> needs to be compiled to generate code the order of the contents of the 
>>> file
>>> are important (ie, the namespace and includes elements need to come 
>>> before
>>> definitions are defined).
>>>
>>> The issue that we are facing is that by implementing
>>> FileIO.sink we cannot determine how many Document objects are
>>> written to a file since this is determined by the runner. This can 
>>> result
>>> in more than one Document being written to a file which will cause
>>> compilation errors.
>>>
>>> We know that this can be controlled by writeDynamic but since we
>>> believe the default behavior for the connector should be to output a
>>> Document to one or more files (depending on sharding) we were wondering 
>>> how
>>> to best accomplish this.
>>>
>>> Best,
>>> Chris
>>>
>>> *This message contains information that may be privileged or
>>> confidential and is the property of the Quantiphi Inc and/or its 
>>> affiliates**.
>>> It is intended only for the person to whom it is addressed. **If
>>> you are not the intended recipient, any review, dissemination,
>>> distribution, copying, storage or other use of all or any portion of 
>>> this
>>> message is strictly prohibited. If you received this message in error,
>>> please immediately notify the sender by reply e-mail and delete this
>>> message in its **entirety*
>>>
>>
> *This message contains information that may be privileged or
> confidential and is the property of the Quantiphi Inc and/or its 
> affiliates**.
> It is intended only for the person to whom it is addressed. **If you
> are not the intended recipient, any review, dissemination, distribution,
> copying, storage or other use of all or any portion of this message is
> strictly prohibited. If you received this message in error, please
> immediately notify the sender by reply e-mail and delete this message in
> its **entirety*
>
 --
>>> *Regards,*
>>>
>>> __

Re: Per Element File Output Without writeDynamic

2019-12-03 Thread Christopher Larsen
Hi Eugene,

Yes I think you've got it correct. In our use case we need to write each
Document in the PCollection to a separate file as multiple Documents in a
file will cause compilation errors and/or incorrect code to be generated by
the Thrift compiler.

Additionally there are some Documents that are so large that we would want
them to be split.

On Mon, Dec 2, 2019 at 9:45 PM Eugene Kirpichov  wrote:

> Hi Christopher,
>
> So, you have a PCollection, and you're writing it to files.
> FileIO.write/writeDynamic will write several Document's to each file -
> however, in your use case some of the individual Document's are so large
> that you want instead each of those large documents to be split into
> several files.
>
> Before we continue, could you confirm whether my understanding is correct?
>
> Thanks.
>
> On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen <
> christopher.lar...@quantiphi.com> wrote:
>
>> Ideally each element (document) will be written to a .thrift file so that
>> it can be compiled without further manipulation.
>>
>> But in the case of an extremely large file I think it would be nice to
>> split into smaller files. As far as splitting points go I think it could be
>> split at a point in the list of definitions. Thoughts?
>>
>> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax  wrote:
>>
>>> What do you mean by shard the output file? Can it be split at any byte
>>> location, or only at specific points?
>>>
>>> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen <
>>> christopher.lar...@quantiphi.com> wrote:
>>>
 Hi Reuven,

 We would like to write each element to one file but still allow the
 runner to shard the output file which could yield more than one output file
 per element.

 On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax  wrote:

> I'm not sure I completely understand the question. Are you saying that
> you want each element to write to only one file, guaranteeing that two
> elements are never written to the same file?
>
> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
> christopher.lar...@quantiphi.com> wrote:
>
>> Hi All,
>>
>> TL/DR: can you extend FileIO.sink to write one or more file per
>> element instead of one or more elements per file?
>>
>> In working with Thrift files we have found that since a .thrift file
>> needs to be compiled to generate code the order of the contents of the 
>> file
>> are important (ie, the namespace and includes elements need to come 
>> before
>> definitions are defined).
>>
>> The issue that we are facing is that by implementing
>> FileIO.sink we cannot determine how many Document objects are
>> written to a file since this is determined by the runner. This can result
>> in more than one Document being written to a file which will cause
>> compilation errors.
>>
>> We know that this can be controlled by writeDynamic but since we
>> believe the default behavior for the connector should be to output a
>> Document to one or more files (depending on sharding) we were wondering 
>> how
>> to best accomplish this.
>>
>> Best,
>> Chris
>>
>> *This message contains information that may be privileged or
>> confidential and is the property of the Quantiphi Inc and/or its 
>> affiliates**.
>> It is intended only for the person to whom it is addressed. **If you
>> are not the intended recipient, any review, dissemination, distribution,
>> copying, storage or other use of all or any portion of this message is
>> strictly prohibited. If you received this message in error, please
>> immediately notify the sender by reply e-mail and delete this message in
>> its **entirety*
>>
>
 *This message contains information that may be privileged or
 confidential and is the property of the Quantiphi Inc and/or its 
 affiliates**.
 It is intended only for the person to whom it is addressed. **If you
 are not the intended recipient, any review, dissemination, distribution,
 copying, storage or other use of all or any portion of this message is
 strictly prohibited. If you received this message in error, please
 immediately notify the sender by reply e-mail and delete this message in
 its **entirety*

>>> --
>> *Regards,*
>>
>> ___
>>
>> *Chris Larsen*
>>
>> Data Engineer | Quantiphi Inc. | US and India
>>
>> http://www.quantiphi.com | Analytics is in our DNA
>>
>> USA: +1 760 504 8477 <(760)%20504-8477>
>> 
>>
>>
>> *This message contains information that may be privileged or confidential
>> and is the property of the Quantiphi Inc and/or its affiliates**. It is
>> intended only for the person to whom it is addressed. **If you are not
>> the intended recipient, any review, dissemination, distribution, copying,
>> storage or other use of all or any 

Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Eugene Kirpichov
Hi Christopher,

So, you have a PCollection, and you're writing it to files.
FileIO.write/writeDynamic will write several Document's to each file -
however, in your use case some of the individual Document's are so large
that you want instead each of those large documents to be split into
several files.

Before we continue, could you confirm whether my understanding is correct?

Thanks.

On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen <
christopher.lar...@quantiphi.com> wrote:

> Ideally each element (document) will be written to a .thrift file so that
> it can be compiled without further manipulation.
>
> But in the case of an extremely large file I think it would be nice to
> split into smaller files. As far as splitting points go I think it could be
> split at a point in the list of definitions. Thoughts?
>
> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax  wrote:
>
>> What do you mean by shard the output file? Can it be split at any byte
>> location, or only at specific points?
>>
>> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen <
>> christopher.lar...@quantiphi.com> wrote:
>>
>>> Hi Reuven,
>>>
>>> We would like to write each element to one file but still allow the
>>> runner to shard the output file which could yield more than one output file
>>> per element.
>>>
>>> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax  wrote:
>>>
 I'm not sure I completely understand the question. Are you saying that
 you want each element to write to only one file, guaranteeing that two
 elements are never written to the same file?

 On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
 christopher.lar...@quantiphi.com> wrote:

> Hi All,
>
> TL/DR: can you extend FileIO.sink to write one or more file per
> element instead of one or more elements per file?
>
> In working with Thrift files we have found that since a .thrift file
> needs to be compiled to generate code the order of the contents of the 
> file
> are important (ie, the namespace and includes elements need to come before
> definitions are defined).
>
> The issue that we are facing is that by implementing
> FileIO.sink we cannot determine how many Document objects are
> written to a file since this is determined by the runner. This can result
> in more than one Document being written to a file which will cause
> compilation errors.
>
> We know that this can be controlled by writeDynamic but since we
> believe the default behavior for the connector should be to output a
> Document to one or more files (depending on sharding) we were wondering 
> how
> to best accomplish this.
>
> Best,
> Chris
>
> *This message contains information that may be privileged or
> confidential and is the property of the Quantiphi Inc and/or its 
> affiliates**.
> It is intended only for the person to whom it is addressed. **If you
> are not the intended recipient, any review, dissemination, distribution,
> copying, storage or other use of all or any portion of this message is
> strictly prohibited. If you received this message in error, please
> immediately notify the sender by reply e-mail and delete this message in
> its **entirety*
>

>>> *This message contains information that may be privileged or
>>> confidential and is the property of the Quantiphi Inc and/or its 
>>> affiliates**.
>>> It is intended only for the person to whom it is addressed. **If you
>>> are not the intended recipient, any review, dissemination, distribution,
>>> copying, storage or other use of all or any portion of this message is
>>> strictly prohibited. If you received this message in error, please
>>> immediately notify the sender by reply e-mail and delete this message in
>>> its **entirety*
>>>
>> --
> *Regards,*
>
> ___
>
> *Chris Larsen*
>
> Data Engineer | Quantiphi Inc. | US and India
>
> http://www.quantiphi.com | Analytics is in our DNA
>
> USA: +1 760 504 8477 <(760)%20504-8477>
> 
>
>
> *This message contains information that may be privileged or confidential
> and is the property of the Quantiphi Inc and/or its affiliates**. It is
> intended only for the person to whom it is addressed. **If you are not
> the intended recipient, any review, dissemination, distribution, copying,
> storage or other use of all or any portion of this message is strictly
> prohibited. If you received this message in error, please immediately
> notify the sender by reply e-mail and delete this message in its *
> *entirety*
>


Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Christopher Larsen
Ideally each element (document) will be written to a .thrift file so that
it can be compiled without further manipulation.

But in the case of an extremely large file I think it would be nice to
split into smaller files. As far as splitting points go I think it could be
split at a point in the list of definitions. Thoughts?

On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax  wrote:

> What do you mean by shard the output file? Can it be split at any byte
> location, or only at specific points?
>
> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen <
> christopher.lar...@quantiphi.com> wrote:
>
>> Hi Reuven,
>>
>> We would like to write each element to one file but still allow the
>> runner to shard the output file which could yield more than one output file
>> per element.
>>
>> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax  wrote:
>>
>>> I'm not sure I completely understand the question. Are you saying that
>>> you want each element to write to only one file, guaranteeing that two
>>> elements are never written to the same file?
>>>
>>> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
>>> christopher.lar...@quantiphi.com> wrote:
>>>
 Hi All,

 TL/DR: can you extend FileIO.sink to write one or more file per
 element instead of one or more elements per file?

 In working with Thrift files we have found that since a .thrift file
 needs to be compiled to generate code the order of the contents of the file
 are important (ie, the namespace and includes elements need to come before
 definitions are defined).

 The issue that we are facing is that by implementing
 FileIO.sink we cannot determine how many Document objects are
 written to a file since this is determined by the runner. This can result
 in more than one Document being written to a file which will cause
 compilation errors.

 We know that this can be controlled by writeDynamic but since we
 believe the default behavior for the connector should be to output a
 Document to one or more files (depending on sharding) we were wondering how
 to best accomplish this.

 Best,
 Chris

 *This message contains information that may be privileged or
 confidential and is the property of the Quantiphi Inc and/or its 
 affiliates**.
 It is intended only for the person to whom it is addressed. **If you
 are not the intended recipient, any review, dissemination, distribution,
 copying, storage or other use of all or any portion of this message is
 strictly prohibited. If you received this message in error, please
 immediately notify the sender by reply e-mail and delete this message in
 its **entirety*

>>>
>> *This message contains information that may be privileged or confidential
>> and is the property of the Quantiphi Inc and/or its affiliates**. It is
>> intended only for the person to whom it is addressed. **If you are not
>> the intended recipient, any review, dissemination, distribution, copying,
>> storage or other use of all or any portion of this message is strictly
>> prohibited. If you received this message in error, please immediately
>> notify the sender by reply e-mail and delete this message in its *
>> *entirety*
>>
> --
*Regards,*

___

*Chris Larsen*

Data Engineer | Quantiphi Inc. | US and India

http://www.quantiphi.com | Analytics is in our DNA

USA: +1 760 504 8477


-- 
_This message contains information that may be privileged or confidential 
and is the property of the Quantiphi Inc and/or its affiliates_. It is 
intended only for the person to whom it is addressed. _If you are not the 
intended recipient, any review, dissemination, distribution, copying, 
storage or other use of all or any portion of this message is strictly 
prohibited. If you received this message in error, please immediately 
notify the sender by reply e-mail and delete this message in its 
*entirety*___


Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Reuven Lax
What do you mean by shard the output file? Can it be split at any byte
location, or only at specific points?

On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen <
christopher.lar...@quantiphi.com> wrote:

> Hi Reuven,
>
> We would like to write each element to one file but still allow the runner
> to shard the output file which could yield more than one output file per
> element.
>
> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax  wrote:
>
>> I'm not sure I completely understand the question. Are you saying that
>> you want each element to write to only one file, guaranteeing that two
>> elements are never written to the same file?
>>
>> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
>> christopher.lar...@quantiphi.com> wrote:
>>
>>> Hi All,
>>>
>>> TL/DR: can you extend FileIO.sink to write one or more file per
>>> element instead of one or more elements per file?
>>>
>>> In working with Thrift files we have found that since a .thrift file
>>> needs to be compiled to generate code the order of the contents of the file
>>> are important (ie, the namespace and includes elements need to come before
>>> definitions are defined).
>>>
>>> The issue that we are facing is that by implementing
>>> FileIO.sink we cannot determine how many Document objects are
>>> written to a file since this is determined by the runner. This can result
>>> in more than one Document being written to a file which will cause
>>> compilation errors.
>>>
>>> We know that this can be controlled by writeDynamic but since we believe
>>> the default behavior for the connector should be to output a Document to
>>> one or more files (depending on sharding) we were wondering how to best
>>> accomplish this.
>>>
>>> Best,
>>> Chris
>>>
>>> *This message contains information that may be privileged or
>>> confidential and is the property of the Quantiphi Inc and/or its 
>>> affiliates**.
>>> It is intended only for the person to whom it is addressed. **If you
>>> are not the intended recipient, any review, dissemination, distribution,
>>> copying, storage or other use of all or any portion of this message is
>>> strictly prohibited. If you received this message in error, please
>>> immediately notify the sender by reply e-mail and delete this message in
>>> its **entirety*
>>>
>>
> *This message contains information that may be privileged or confidential
> and is the property of the Quantiphi Inc and/or its affiliates**. It is
> intended only for the person to whom it is addressed. **If you are not
> the intended recipient, any review, dissemination, distribution, copying,
> storage or other use of all or any portion of this message is strictly
> prohibited. If you received this message in error, please immediately
> notify the sender by reply e-mail and delete this message in its *
> *entirety*
>


Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Christopher Larsen
Hi Reuven,

We would like to write each element to one file but still allow the runner
to shard the output file which could yield more than one output file per
element.

On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax  wrote:

> I'm not sure I completely understand the question. Are you saying that you
> want each element to write to only one file, guaranteeing that two elements
> are never written to the same file?
>
> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
> christopher.lar...@quantiphi.com> wrote:
>
>> Hi All,
>>
>> TL/DR: can you extend FileIO.sink to write one or more file per
>> element instead of one or more elements per file?
>>
>> In working with Thrift files we have found that since a .thrift file
>> needs to be compiled to generate code the order of the contents of the file
>> are important (ie, the namespace and includes elements need to come before
>> definitions are defined).
>>
>> The issue that we are facing is that by implementing
>> FileIO.sink we cannot determine how many Document objects are
>> written to a file since this is determined by the runner. This can result
>> in more than one Document being written to a file which will cause
>> compilation errors.
>>
>> We know that this can be controlled by writeDynamic but since we believe
>> the default behavior for the connector should be to output a Document to
>> one or more files (depending on sharding) we were wondering how to best
>> accomplish this.
>>
>> Best,
>> Chris
>>
>> *This message contains information that may be privileged or confidential
>> and is the property of the Quantiphi Inc and/or its affiliates**. It is
>> intended only for the person to whom it is addressed. **If you are not
>> the intended recipient, any review, dissemination, distribution, copying,
>> storage or other use of all or any portion of this message is strictly
>> prohibited. If you received this message in error, please immediately
>> notify the sender by reply e-mail and delete this message in its *
>> *entirety*
>>
>

-- 
_This message contains information that may be privileged or confidential 
and is the property of the Quantiphi Inc and/or its affiliates_. It is 
intended only for the person to whom it is addressed. _If you are not the 
intended recipient, any review, dissemination, distribution, copying, 
storage or other use of all or any portion of this message is strictly 
prohibited. If you received this message in error, please immediately 
notify the sender by reply e-mail and delete this message in its 
*entirety*___


Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Reuven Lax
I'm not sure I completely understand the question. Are you saying that you
want each element to write to only one file, guaranteeing that two elements
are never written to the same file?

On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
christopher.lar...@quantiphi.com> wrote:

> Hi All,
>
> TL/DR: can you extend FileIO.sink to write one or more file per element
> instead of one or more elements per file?
>
> In working with Thrift files we have found that since a .thrift file needs
> to be compiled to generate code the order of the contents of the file are
> important (ie, the namespace and includes elements need to come before
> definitions are defined).
>
> The issue that we are facing is that by implementing FileIO.sink
> we cannot determine how many Document objects are written to a file since
> this is determined by the runner. This can result in more than one Document
> being written to a file which will cause compilation errors.
>
> We know that this can be controlled by writeDynamic but since we believe
> the default behavior for the connector should be to output a Document to
> one or more files (depending on sharding) we were wondering how to best
> accomplish this.
>
> Best,
> Chris
>
> *This message contains information that may be privileged or confidential
> and is the property of the Quantiphi Inc and/or its affiliates**. It is
> intended only for the person to whom it is addressed. **If you are not
> the intended recipient, any review, dissemination, distribution, copying,
> storage or other use of all or any portion of this message is strictly
> prohibited. If you received this message in error, please immediately
> notify the sender by reply e-mail and delete this message in its *
> *entirety*
>


Per Element File Output Without writeDynamic

2019-12-02 Thread Christopher Larsen
Hi All,

TL/DR: can you extend FileIO.sink to write one or more file per element
instead of one or more elements per file?

In working with Thrift files we have found that since a .thrift file needs
to be compiled to generate code the order of the contents of the file are
important (ie, the namespace and includes elements need to come before
definitions are defined).

The issue that we are facing is that by implementing FileIO.sink
we cannot determine how many Document objects are written to a file since
this is determined by the runner. This can result in more than one Document
being written to a file which will cause compilation errors.

We know that this can be controlled by writeDynamic but since we believe
the default behavior for the connector should be to output a Document to
one or more files (depending on sharding) we were wondering how to best
accomplish this.

Best,
Chris

-- 
_This message contains information that may be privileged or confidential 
and is the property of the Quantiphi Inc and/or its affiliates_. It is 
intended only for the person to whom it is addressed. _If you are not the 
intended recipient, any review, dissemination, distribution, copying, 
storage or other use of all or any portion of this message is strictly 
prohibited. If you received this message in error, please immediately 
notify the sender by reply e-mail and delete this message in its 
*entirety*___