Re: rename: BeamRecord -> Row

2018-02-02 Thread Romain Manni-Bucau
Hi

Shouldnt the discussion on schema which has a direct impact on this generic
container be closed before any action on this?


Le 3 févr. 2018 01:09, "Ankur Chauhan"  a écrit :

> ++
>
> On Fri, Feb 2, 2018 at 1:33 PM Rafael Fernandez 
> wrote:
>
>> Very strong +1
>>
>>
>> On Fri, Feb 2, 2018 at 1:24 PM Reuven Lax  wrote:
>>
>>> We're looking at renaming the BeamRecord class
>>> , that was used for columnar
>>> data. There was sufficient discussion on the naming, that I want to make
>>> sure the dev list is aware of naming plans here.
>>>
>>> BeamRecord is a columnar, field-based record. Currently it's used by
>>> BeamSQL, and the plan is to use it for schemas as well. "Record" is a
>>> confusing name for this class, as all elements in the Beam model are
>>> referred to as "records," whether or not they have schemas. "Row" is a much
>>> clearer name.
>>>
>>> There was a lot of discussion whether to name this BeamRow or just plain
>>> Row (in the org.apache.beam.values namespace). The argument in favor of
>>> BeamRow was so that people aren't forced to qualify their type names in the
>>> case of a conflict with a Row from another package. The argument in favor
>>> of Row was that it's a better name, it's in the Beam namespace anyway, and
>>> it's what the rest of the world (Cassandra, Hive, Spark, etc.) calls
>>> similar classes.
>>>
>>> RIght not consensus on the PR is leaning to Row. If you feel strongly,
>>> please speak up :)
>>>
>>> Reuven
>>>
>>


Re: rename: BeamRecord -> Row

2018-02-02 Thread Ankur Chauhan
++

On Fri, Feb 2, 2018 at 1:33 PM Rafael Fernandez  wrote:

> Very strong +1
>
>
> On Fri, Feb 2, 2018 at 1:24 PM Reuven Lax  wrote:
>
>> We're looking at renaming the BeamRecord class
>> , that was used for columnar
>> data. There was sufficient discussion on the naming, that I want to make
>> sure the dev list is aware of naming plans here.
>>
>> BeamRecord is a columnar, field-based record. Currently it's used by
>> BeamSQL, and the plan is to use it for schemas as well. "Record" is a
>> confusing name for this class, as all elements in the Beam model are
>> referred to as "records," whether or not they have schemas. "Row" is a much
>> clearer name.
>>
>> There was a lot of discussion whether to name this BeamRow or just plain
>> Row (in the org.apache.beam.values namespace). The argument in favor of
>> BeamRow was so that people aren't forced to qualify their type names in the
>> case of a conflict with a Row from another package. The argument in favor
>> of Row was that it's a better name, it's in the Beam namespace anyway, and
>> it's what the rest of the world (Cassandra, Hive, Spark, etc.) calls
>> similar classes.
>>
>> RIght not consensus on the PR is leaning to Row. If you feel strongly,
>> please speak up :)
>>
>> Reuven
>>
>


Re: rename: BeamRecord -> Row

2018-02-02 Thread Rafael Fernandez
Very strong +1


On Fri, Feb 2, 2018 at 1:24 PM Reuven Lax  wrote:

> We're looking at renaming the BeamRecord class
> , that was used for columnar
> data. There was sufficient discussion on the naming, that I want to make
> sure the dev list is aware of naming plans here.
>
> BeamRecord is a columnar, field-based record. Currently it's used by
> BeamSQL, and the plan is to use it for schemas as well. "Record" is a
> confusing name for this class, as all elements in the Beam model are
> referred to as "records," whether or not they have schemas. "Row" is a much
> clearer name.
>
> There was a lot of discussion whether to name this BeamRow or just plain
> Row (in the org.apache.beam.values namespace). The argument in favor of
> BeamRow was so that people aren't forced to qualify their type names in the
> case of a conflict with a Row from another package. The argument in favor
> of Row was that it's a better name, it's in the Beam namespace anyway, and
> it's what the rest of the world (Cassandra, Hive, Spark, etc.) calls
> similar classes.
>
> RIght not consensus on the PR is leaning to Row. If you feel strongly,
> please speak up :)
>
> Reuven
>


smime.p7s
Description: S/MIME Cryptographic Signature


rename: BeamRecord -> Row

2018-02-02 Thread Reuven Lax
We're looking at renaming the BeamRecord class
, that was used for columnar
data. There was sufficient discussion on the naming, that I want to make
sure the dev list is aware of naming plans here.

BeamRecord is a columnar, field-based record. Currently it's used by
BeamSQL, and the plan is to use it for schemas as well. "Record" is a
confusing name for this class, as all elements in the Beam model are
referred to as "records," whether or not they have schemas. "Row" is a much
clearer name.

There was a lot of discussion whether to name this BeamRow or just plain
Row (in the org.apache.beam.values namespace). The argument in favor of
BeamRow was so that people aren't forced to qualify their type names in the
case of a conflict with a Row from another package. The argument in favor
of Row was that it's a better name, it's in the Beam namespace anyway, and
it's what the rest of the world (Cassandra, Hive, Spark, etc.) calls
similar classes.

RIght not consensus on the PR is leaning to Row. If you feel strongly,
please speak up :)

Reuven


Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Robert Bradshaw
Pipeline stages must be retry-tolerant. E.g. the VM it's running on
might get shut down. We should not be failing jobs in this case.

It seems the current implementation could only produce bad results if
(1) unrelated output files already existed and (2) the temporary files
were either not written or deleted out-of-band. (2) seems really
unlikely, but can be eliminated if we ensure that (1) cannot happen
(e.g. deleting destination files before starting the rename).

On Fri, Feb 2, 2018 at 12:13 PM, Reuven Lax  wrote:
>
>
> On Fri, Feb 2, 2018 at 11:17 AM, Chamikara Jayalath 
> wrote:
>>
>> Currently, Python file-based sink is batch only.
>
>
> Sure, but that won't be true forever.
>
>>
>>
>> Regarding Raghu's question, stage/pipeline failure should not be
>> considered as a data loss but I prefer overriding existing output and
>> completing a possibly expensive pipeline over failing the whole pipeline due
>> to one or more existing files.
>>
>> - Cham
>>
>>
>> On Fri, Feb 2, 2018 at 10:21 AM Reuven Lax  wrote:
>>>
>>> However this code might run in streaming as well, right?
>>>
>>> On Fri, Feb 2, 2018 at 9:54 AM, Raghu Angadi  wrote:

 In a batch pipeline, is it considered a data loss if the the stage fails
 (assuming it does not set IGNORE_MISSING_FILES and fails hard)? If not, it
 might be better to favor correctness and fail in current implementation.


 On Thu, Feb 1, 2018 at 4:07 PM, Robert Bradshaw 
 wrote:
>
> You could add a step to delete all of dest before a barrier and the
> step that does the rename as outlined. In that case, any dest file
> that exists must be good.
>
> On Thu, Feb 1, 2018 at 2:52 PM, Eugene Kirpichov 
> wrote:
> > I think this is still unsafe in case exists(dst) (e.g. this is a
> > re-run of a
> > pipeline) but src is missing due to some bad reason. However it's
> > probably
> > better than what we have (e.g. we currently certainly don't perform
> > checksum
> > checks).
> >
> > On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri  wrote:
> >>
> >> For GCS, I would do what I believe we already do.
> >> rename(src, dst):
> >> - if !exists(src) and exists(dst) return 0
> >> - if !exists(src) and !exists(dst) return error
> >> - if exists(src) and exists(dst) { if checksum(src) == checksum(dst)
> >> return 0 else delete(dst) }
> >> - Start a GCS copy from src to dst.
> >> - Wait for GCS copy to complete.
> >> - delete(src)
> >>
> >> For filesystems that don't have checksum() metadata, size() can be
> >> used
> >> instead.
> >>
> >> I've opened a bug to track this:
> >> https://issues.apache.org/jira/browse/BEAM-3600
> >>
> >> On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov
> >> 
> >> wrote:
> >>>
> >>> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore
> >>> files
> >>> that are missing for more ominous reasons than just this being a
> >>> non-first
> >>> attempt at renaming src to dst. E.g. if there was a bug in
> >>> constructing the
> >>> filename to be renamed, or if we somehow messed up the order of
> >>> rename vs
> >>> cleanup, etc. - these situations with IGNORE_MISSING_FILES would
> >>> lead to
> >>> silent data loss (likely caught by unit tests though - so this is
> >>> not a
> >>> super serious issue).
> >>>
> >>> Basically I just can't think of a case when I was copying files and
> >>> thinking "oh man, I wish it didn't give an error if the stuff I'm
> >>> copying
> >>> doesn't exist" - the option exists only because we couldn't come up
> >>> with
> >>> another way to implement idempotent rename on GCS.
> >>>
> >>> What's your idea of how a safe retryable GCS rename() could work?
> >>>
> >>> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:
> 
>  Eugene, if I get this right, you're saying that
>  IGNORE_MISSING_FILES is
>  unsafe because it will skip (src, dst) pairs where neither exist?
>  (it only
>  looks if src exists)
> 
>  For GCS, we can construct a safe retryable rename() operation,
>  assuming
>  that copy() and delete() are atomic for a single file or pair of
>  files.
> 
> 
> 
>  On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi 
>  wrote:
> >
> > On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov
> >  wrote:
> >>
> >> As far as I know, the current implementation of file sinks is
> >> the only
> >> reason why the flag IGNORE_MISSING for copying even exists -
> >> there's no

Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Reuven Lax
On Fri, Feb 2, 2018 at 11:17 AM, Chamikara Jayalath 
wrote:

> Currently, Python file-based sink is batch only.
>

Sure, but that won't be true forever.


>
> Regarding Raghu's question, stage/pipeline failure should not be
> considered as a data loss but I prefer overriding existing output and
> completing a possibly expensive pipeline over failing the whole pipeline
> due to one or more existing files.
>
> - Cham
>
>
> On Fri, Feb 2, 2018 at 10:21 AM Reuven Lax  wrote:
>
>> However this code might run in streaming as well, right?
>>
>> On Fri, Feb 2, 2018 at 9:54 AM, Raghu Angadi  wrote:
>>
>>> In a batch pipeline, is it considered a data loss if the the stage fails
>>> (assuming it does not set IGNORE_MISSING_FILES and fails hard)? If not, it
>>> might be better to favor correctness and fail in current implementation.
>>>
>>>
>>> On Thu, Feb 1, 2018 at 4:07 PM, Robert Bradshaw 
>>> wrote:
>>>
 You could add a step to delete all of dest before a barrier and the
 step that does the rename as outlined. In that case, any dest file
 that exists must be good.

 On Thu, Feb 1, 2018 at 2:52 PM, Eugene Kirpichov 
 wrote:
 > I think this is still unsafe in case exists(dst) (e.g. this is a
 re-run of a
 > pipeline) but src is missing due to some bad reason. However it's
 probably
 > better than what we have (e.g. we currently certainly don't perform
 checksum
 > checks).
 >
 > On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri  wrote:
 >>
 >> For GCS, I would do what I believe we already do.
 >> rename(src, dst):
 >> - if !exists(src) and exists(dst) return 0
 >> - if !exists(src) and !exists(dst) return error
 >> - if exists(src) and exists(dst) { if checksum(src) == checksum(dst)
 >> return 0 else delete(dst) }
 >> - Start a GCS copy from src to dst.
 >> - Wait for GCS copy to complete.
 >> - delete(src)
 >>
 >> For filesystems that don't have checksum() metadata, size() can be
 used
 >> instead.
 >>
 >> I've opened a bug to track this:
 >> https://issues.apache.org/jira/browse/BEAM-3600
 >>
 >> On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov <
 kirpic...@google.com>
 >> wrote:
 >>>
 >>> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore
 files
 >>> that are missing for more ominous reasons than just this being a
 non-first
 >>> attempt at renaming src to dst. E.g. if there was a bug in
 constructing the
 >>> filename to be renamed, or if we somehow messed up the order of
 rename vs
 >>> cleanup, etc. - these situations with IGNORE_MISSING_FILES would
 lead to
 >>> silent data loss (likely caught by unit tests though - so this is
 not a
 >>> super serious issue).
 >>>
 >>> Basically I just can't think of a case when I was copying files and
 >>> thinking "oh man, I wish it didn't give an error if the stuff I'm
 copying
 >>> doesn't exist" - the option exists only because we couldn't come up
 with
 >>> another way to implement idempotent rename on GCS.
 >>>
 >>> What's your idea of how a safe retryable GCS rename() could work?
 >>>
 >>> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:
 
  Eugene, if I get this right, you're saying that
 IGNORE_MISSING_FILES is
  unsafe because it will skip (src, dst) pairs where neither exist?
 (it only
  looks if src exists)
 
  For GCS, we can construct a safe retryable rename() operation,
 assuming
  that copy() and delete() are atomic for a single file or pair of
 files.
 
 
 
  On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi 
 wrote:
 >
 > On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov
 >  wrote:
 >>
 >> As far as I know, the current implementation of file sinks is
 the only
 >> reason why the flag IGNORE_MISSING for copying even exists -
 there's no
 >> other compelling reason to justify it. We implement "rename" as
 "copy, then
 >> delete" (in a single DoFn), so for idempodency of this operation
 we need to
 >> ignore the copying of a non-existent file.
 >>
 >> I think the right way to go would be to change the
 implementation of
 >> renaming to have a @RequiresStableInput (or reshuffle) in the
 middle, so
 >> it's made of 2 individually idempotent operations:
 >> 1) copy, which would fail if input is missing, and would
 overwrite
 >> output if it exists
 >> -- reshuffle --
 >> 2) delete, which would not fail if input is missing.
 >
 >
 > Something like this 

[SQL] Fix for HOP windows

2018-02-02 Thread Anton Kedin
Hi,

If you're not using Beam SQL's HOP windowing functions, you're not affected.


*The problem*

Calcite defines

HOP windowing function like this:
  - HOP(timestamp_field, frequency_interval, window_size)

Beam SQL implemented incorrectly, swapping frequency_interval
and window_size:
  - HOP(timestamp_field, window_size, frequency_interval)


*The fix*

*The commit containing the fix swapping the parameters match Calcite doc
has been just pushed, see BEAM-3610
.*


Regards,
Anton


Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Raghu Angadi
On Fri, Feb 2, 2018 at 10:21 AM, Reuven Lax  wrote:

> However this code might run in streaming as well, right?
>

True. What is the best practices recommendation to handle it? Probably the
author of the sink transform should consider the context and decide if
needs to be retry tolerant while setting the transform. I think the current
behavior of not overwriting the output would be very surprising to the
unsuspecting users.


>
> On Fri, Feb 2, 2018 at 9:54 AM, Raghu Angadi  wrote:
>
>> In a batch pipeline, is it considered a data loss if the the stage fails
>> (assuming it does not set IGNORE_MISSING_FILES and fails hard)? If not, it
>> might be better to favor correctness and fail in current implementation.
>>
>>
>> On Thu, Feb 1, 2018 at 4:07 PM, Robert Bradshaw 
>> wrote:
>>
>>> You could add a step to delete all of dest before a barrier and the
>>> step that does the rename as outlined. In that case, any dest file
>>> that exists must be good.
>>>
>>> On Thu, Feb 1, 2018 at 2:52 PM, Eugene Kirpichov 
>>> wrote:
>>> > I think this is still unsafe in case exists(dst) (e.g. this is a
>>> re-run of a
>>> > pipeline) but src is missing due to some bad reason. However it's
>>> probably
>>> > better than what we have (e.g. we currently certainly don't perform
>>> checksum
>>> > checks).
>>> >
>>> > On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri  wrote:
>>> >>
>>> >> For GCS, I would do what I believe we already do.
>>> >> rename(src, dst):
>>> >> - if !exists(src) and exists(dst) return 0
>>> >> - if !exists(src) and !exists(dst) return error
>>> >> - if exists(src) and exists(dst) { if checksum(src) == checksum(dst)
>>> >> return 0 else delete(dst) }
>>> >> - Start a GCS copy from src to dst.
>>> >> - Wait for GCS copy to complete.
>>> >> - delete(src)
>>> >>
>>> >> For filesystems that don't have checksum() metadata, size() can be
>>> used
>>> >> instead.
>>> >>
>>> >> I've opened a bug to track this:
>>> >> https://issues.apache.org/jira/browse/BEAM-3600
>>> >>
>>> >> On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov >> >
>>> >> wrote:
>>> >>>
>>> >>> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore
>>> files
>>> >>> that are missing for more ominous reasons than just this being a
>>> non-first
>>> >>> attempt at renaming src to dst. E.g. if there was a bug in
>>> constructing the
>>> >>> filename to be renamed, or if we somehow messed up the order of
>>> rename vs
>>> >>> cleanup, etc. - these situations with IGNORE_MISSING_FILES would
>>> lead to
>>> >>> silent data loss (likely caught by unit tests though - so this is
>>> not a
>>> >>> super serious issue).
>>> >>>
>>> >>> Basically I just can't think of a case when I was copying files and
>>> >>> thinking "oh man, I wish it didn't give an error if the stuff I'm
>>> copying
>>> >>> doesn't exist" - the option exists only because we couldn't come up
>>> with
>>> >>> another way to implement idempotent rename on GCS.
>>> >>>
>>> >>> What's your idea of how a safe retryable GCS rename() could work?
>>> >>>
>>> >>> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:
>>> 
>>>  Eugene, if I get this right, you're saying that
>>> IGNORE_MISSING_FILES is
>>>  unsafe because it will skip (src, dst) pairs where neither exist?
>>> (it only
>>>  looks if src exists)
>>> 
>>>  For GCS, we can construct a safe retryable rename() operation,
>>> assuming
>>>  that copy() and delete() are atomic for a single file or pair of
>>> files.
>>> 
>>> 
>>> 
>>>  On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi 
>>> wrote:
>>> >
>>> > On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov
>>> >  wrote:
>>> >>
>>> >> As far as I know, the current implementation of file sinks is the
>>> only
>>> >> reason why the flag IGNORE_MISSING for copying even exists -
>>> there's no
>>> >> other compelling reason to justify it. We implement "rename" as
>>> "copy, then
>>> >> delete" (in a single DoFn), so for idempodency of this operation
>>> we need to
>>> >> ignore the copying of a non-existent file.
>>> >>
>>> >> I think the right way to go would be to change the implementation
>>> of
>>> >> renaming to have a @RequiresStableInput (or reshuffle) in the
>>> middle, so
>>> >> it's made of 2 individually idempotent operations:
>>> >> 1) copy, which would fail if input is missing, and would overwrite
>>> >> output if it exists
>>> >> -- reshuffle --
>>> >> 2) delete, which would not fail if input is missing.
>>> >
>>> >
>>> > Something like this is needed only in streaming, right?
>>> >
>>> > Raghu.
>>> >
>>> >>
>>> >> That way first everything is copied (possibly via multiple
>>> attempts),
>>> >> and then old files are deleted (possibly via multiple attempts).
>>> 

Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Chamikara Jayalath
Currently, Python file-based sink is batch only.

Regarding Raghu's question, stage/pipeline failure should not be considered
as a data loss but I prefer overriding existing output and completing a
possibly expensive pipeline over failing the whole pipeline due to one or
more existing files.

- Cham

On Fri, Feb 2, 2018 at 10:21 AM Reuven Lax  wrote:

> However this code might run in streaming as well, right?
>
> On Fri, Feb 2, 2018 at 9:54 AM, Raghu Angadi  wrote:
>
>> In a batch pipeline, is it considered a data loss if the the stage fails
>> (assuming it does not set IGNORE_MISSING_FILES and fails hard)? If not, it
>> might be better to favor correctness and fail in current implementation.
>>
>>
>> On Thu, Feb 1, 2018 at 4:07 PM, Robert Bradshaw 
>> wrote:
>>
>>> You could add a step to delete all of dest before a barrier and the
>>> step that does the rename as outlined. In that case, any dest file
>>> that exists must be good.
>>>
>>> On Thu, Feb 1, 2018 at 2:52 PM, Eugene Kirpichov 
>>> wrote:
>>> > I think this is still unsafe in case exists(dst) (e.g. this is a
>>> re-run of a
>>> > pipeline) but src is missing due to some bad reason. However it's
>>> probably
>>> > better than what we have (e.g. we currently certainly don't perform
>>> checksum
>>> > checks).
>>> >
>>> > On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri  wrote:
>>> >>
>>> >> For GCS, I would do what I believe we already do.
>>> >> rename(src, dst):
>>> >> - if !exists(src) and exists(dst) return 0
>>> >> - if !exists(src) and !exists(dst) return error
>>> >> - if exists(src) and exists(dst) { if checksum(src) == checksum(dst)
>>> >> return 0 else delete(dst) }
>>> >> - Start a GCS copy from src to dst.
>>> >> - Wait for GCS copy to complete.
>>> >> - delete(src)
>>> >>
>>> >> For filesystems that don't have checksum() metadata, size() can be
>>> used
>>> >> instead.
>>> >>
>>> >> I've opened a bug to track this:
>>> >> https://issues.apache.org/jira/browse/BEAM-3600
>>> >>
>>> >> On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov >> >
>>> >> wrote:
>>> >>>
>>> >>> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore
>>> files
>>> >>> that are missing for more ominous reasons than just this being a
>>> non-first
>>> >>> attempt at renaming src to dst. E.g. if there was a bug in
>>> constructing the
>>> >>> filename to be renamed, or if we somehow messed up the order of
>>> rename vs
>>> >>> cleanup, etc. - these situations with IGNORE_MISSING_FILES would
>>> lead to
>>> >>> silent data loss (likely caught by unit tests though - so this is
>>> not a
>>> >>> super serious issue).
>>> >>>
>>> >>> Basically I just can't think of a case when I was copying files and
>>> >>> thinking "oh man, I wish it didn't give an error if the stuff I'm
>>> copying
>>> >>> doesn't exist" - the option exists only because we couldn't come up
>>> with
>>> >>> another way to implement idempotent rename on GCS.
>>> >>>
>>> >>> What's your idea of how a safe retryable GCS rename() could work?
>>> >>>
>>> >>> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:
>>> 
>>>  Eugene, if I get this right, you're saying that
>>> IGNORE_MISSING_FILES is
>>>  unsafe because it will skip (src, dst) pairs where neither exist?
>>> (it only
>>>  looks if src exists)
>>> 
>>>  For GCS, we can construct a safe retryable rename() operation,
>>> assuming
>>>  that copy() and delete() are atomic for a single file or pair of
>>> files.
>>> 
>>> 
>>> 
>>>  On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi 
>>> wrote:
>>> >
>>> > On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov
>>> >  wrote:
>>> >>
>>> >> As far as I know, the current implementation of file sinks is the
>>> only
>>> >> reason why the flag IGNORE_MISSING for copying even exists -
>>> there's no
>>> >> other compelling reason to justify it. We implement "rename" as
>>> "copy, then
>>> >> delete" (in a single DoFn), so for idempodency of this operation
>>> we need to
>>> >> ignore the copying of a non-existent file.
>>> >>
>>> >> I think the right way to go would be to change the implementation
>>> of
>>> >> renaming to have a @RequiresStableInput (or reshuffle) in the
>>> middle, so
>>> >> it's made of 2 individually idempotent operations:
>>> >> 1) copy, which would fail if input is missing, and would overwrite
>>> >> output if it exists
>>> >> -- reshuffle --
>>> >> 2) delete, which would not fail if input is missing.
>>> >
>>> >
>>> > Something like this is needed only in streaming, right?
>>> >
>>> > Raghu.
>>> >
>>> >>
>>> >> That way first everything is copied (possibly via multiple
>>> attempts),
>>> >> and then old files are deleted (possibly via multiple attempts).
>>> >>
>>> 

Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

2018-02-02 Thread Kenneth Knowles
Awesome, nice!

On Fri, Feb 2, 2018 at 11:00 AM, Charles Chen  wrote:

> Thanks Kenn.  We already do the Runner API roundtripping (I believe Robert
> implemented this).  With this change, we would start doing exactly what
> you're suggesting, where we apply overrides to a post-deserialization
> pipeline.
>
> On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles  wrote:
>
>> +1 for removing apply_*
>>
>> For the Java SDK, removing specialized intercepts was an important first
>> step towards the portability framework. I wonder if there is a way for the
>> Python SDK to leapfrog, taking advantage of some of the lessons that Java
>> learned a bit more painfully. Most pertinent I think is that if an SDK's
>> role is to construct a pipeline and ship the proto to a runner (service)
>> then overrides apply to a post-deserialization pipeline. The Java
>> DirectRunner does a proto round-trip to avoid accidentally depending on
>> things that are not really part of the pipeline. I would this crisp
>> abstraction enforcement would add even more value to Python.
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen  wrote:
>>
>>> In the Python DirectRunner, we currently use apply_* overrides to
>>> override the operation of the default .expand() operation for certain
>>> transforms. For example, GroupByKey has a special implementation in the
>>> DirectRunner, so we use an apply_* override hook to replace the
>>> implementation of GroupByKey.expand().
>>>
>>> However, this strategy has drawbacks. Because this override operation
>>> happens eagerly during graph construction, the pipeline graph is
>>> specialized and modified before a specific runner is bound to the
>>> pipeline's execution. This makes the pipeline graph non-portable and blocks
>>> full migration to using the Runner API pipeline representation in the
>>> DirectRunner.
>>>
>>> By contrast, the SDK's PTransformOverride mechanism allows the
>>> expression of matchers that operate on the unspecialized graph, replacing
>>> PTransforms as necessary to produce a DirectRunner-specialized pipeline
>>> graph for execution.
>>>
>>> I therefore propose to replace these eager apply_* overrides with
>>> PTransformOverrides that operate on the completely constructed graph.
>>>
>>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
>>> I've prepared a candidate patch at https://github.com/apache/
>>> incubator-beam/pull/4529.
>>>
>>> Best,
>>> Charles
>>>
>>
>>


Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

2018-02-02 Thread Ahmet Altay
+1 to this change.

Thank you Charles for improving the DirectRunner, sharing your progress and
seeking feedback. This change would allow us to migrate to a faster
DirectRunner for Python. A long time requested feature and an important
part of the first use experience for new users trying out Beam.

Ahmet

On Fri, Feb 2, 2018 at 11:00 AM, Charles Chen  wrote:

> Thanks Kenn.  We already do the Runner API roundtripping (I believe Robert
> implemented this).  With this change, we would start doing exactly what
> you're suggesting, where we apply overrides to a post-deserialization
> pipeline.
>
> On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles  wrote:
>
>> +1 for removing apply_*
>>
>> For the Java SDK, removing specialized intercepts was an important first
>> step towards the portability framework. I wonder if there is a way for the
>> Python SDK to leapfrog, taking advantage of some of the lessons that Java
>> learned a bit more painfully. Most pertinent I think is that if an SDK's
>> role is to construct a pipeline and ship the proto to a runner (service)
>> then overrides apply to a post-deserialization pipeline. The Java
>> DirectRunner does a proto round-trip to avoid accidentally depending on
>> things that are not really part of the pipeline. I would this crisp
>> abstraction enforcement would add even more value to Python.
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen  wrote:
>>
>>> In the Python DirectRunner, we currently use apply_* overrides to
>>> override the operation of the default .expand() operation for certain
>>> transforms. For example, GroupByKey has a special implementation in the
>>> DirectRunner, so we use an apply_* override hook to replace the
>>> implementation of GroupByKey.expand().
>>>
>>> However, this strategy has drawbacks. Because this override operation
>>> happens eagerly during graph construction, the pipeline graph is
>>> specialized and modified before a specific runner is bound to the
>>> pipeline's execution. This makes the pipeline graph non-portable and blocks
>>> full migration to using the Runner API pipeline representation in the
>>> DirectRunner.
>>>
>>> By contrast, the SDK's PTransformOverride mechanism allows the
>>> expression of matchers that operate on the unspecialized graph, replacing
>>> PTransforms as necessary to produce a DirectRunner-specialized pipeline
>>> graph for execution.
>>>
>>> I therefore propose to replace these eager apply_* overrides with
>>> PTransformOverrides that operate on the completely constructed graph.
>>>
>>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
>>> I've prepared a candidate patch at https://github.com/apache/
>>> incubator-beam/pull/4529.
>>>
>>> Best,
>>> Charles
>>>
>>
>>


Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

2018-02-02 Thread Charles Chen
Thanks Kenn.  We already do the Runner API roundtripping (I believe Robert
implemented this).  With this change, we would start doing exactly what
you're suggesting, where we apply overrides to a post-deserialization
pipeline.

On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles  wrote:

> +1 for removing apply_*
>
> For the Java SDK, removing specialized intercepts was an important first
> step towards the portability framework. I wonder if there is a way for the
> Python SDK to leapfrog, taking advantage of some of the lessons that Java
> learned a bit more painfully. Most pertinent I think is that if an SDK's
> role is to construct a pipeline and ship the proto to a runner (service)
> then overrides apply to a post-deserialization pipeline. The Java
> DirectRunner does a proto round-trip to avoid accidentally depending on
> things that are not really part of the pipeline. I would this crisp
> abstraction enforcement would add even more value to Python.
>
> Kenn
>
> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen  wrote:
>
>> In the Python DirectRunner, we currently use apply_* overrides to
>> override the operation of the default .expand() operation for certain
>> transforms. For example, GroupByKey has a special implementation in the
>> DirectRunner, so we use an apply_* override hook to replace the
>> implementation of GroupByKey.expand().
>>
>> However, this strategy has drawbacks. Because this override operation
>> happens eagerly during graph construction, the pipeline graph is
>> specialized and modified before a specific runner is bound to the
>> pipeline's execution. This makes the pipeline graph non-portable and blocks
>> full migration to using the Runner API pipeline representation in the
>> DirectRunner.
>>
>> By contrast, the SDK's PTransformOverride mechanism allows the expression
>> of matchers that operate on the unspecialized graph, replacing PTransforms
>> as necessary to produce a DirectRunner-specialized pipeline graph for
>> execution.
>>
>> I therefore propose to replace these eager apply_* overrides with
>> PTransformOverrides that operate on the completely constructed graph.
>>
>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
>> I've prepared a candidate patch at
>> https://github.com/apache/incubator-beam/pull/4529.
>>
>> Best,
>> Charles
>>
>
>


Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Reuven Lax
However this code might run in streaming as well, right?

On Fri, Feb 2, 2018 at 9:54 AM, Raghu Angadi  wrote:

> In a batch pipeline, is it considered a data loss if the the stage fails
> (assuming it does not set IGNORE_MISSING_FILES and fails hard)? If not, it
> might be better to favor correctness and fail in current implementation.
>
>
> On Thu, Feb 1, 2018 at 4:07 PM, Robert Bradshaw 
> wrote:
>
>> You could add a step to delete all of dest before a barrier and the
>> step that does the rename as outlined. In that case, any dest file
>> that exists must be good.
>>
>> On Thu, Feb 1, 2018 at 2:52 PM, Eugene Kirpichov 
>> wrote:
>> > I think this is still unsafe in case exists(dst) (e.g. this is a re-run
>> of a
>> > pipeline) but src is missing due to some bad reason. However it's
>> probably
>> > better than what we have (e.g. we currently certainly don't perform
>> checksum
>> > checks).
>> >
>> > On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri  wrote:
>> >>
>> >> For GCS, I would do what I believe we already do.
>> >> rename(src, dst):
>> >> - if !exists(src) and exists(dst) return 0
>> >> - if !exists(src) and !exists(dst) return error
>> >> - if exists(src) and exists(dst) { if checksum(src) == checksum(dst)
>> >> return 0 else delete(dst) }
>> >> - Start a GCS copy from src to dst.
>> >> - Wait for GCS copy to complete.
>> >> - delete(src)
>> >>
>> >> For filesystems that don't have checksum() metadata, size() can be used
>> >> instead.
>> >>
>> >> I've opened a bug to track this:
>> >> https://issues.apache.org/jira/browse/BEAM-3600
>> >>
>> >> On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov 
>> >> wrote:
>> >>>
>> >>> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore
>> files
>> >>> that are missing for more ominous reasons than just this being a
>> non-first
>> >>> attempt at renaming src to dst. E.g. if there was a bug in
>> constructing the
>> >>> filename to be renamed, or if we somehow messed up the order of
>> rename vs
>> >>> cleanup, etc. - these situations with IGNORE_MISSING_FILES would lead
>> to
>> >>> silent data loss (likely caught by unit tests though - so this is not
>> a
>> >>> super serious issue).
>> >>>
>> >>> Basically I just can't think of a case when I was copying files and
>> >>> thinking "oh man, I wish it didn't give an error if the stuff I'm
>> copying
>> >>> doesn't exist" - the option exists only because we couldn't come up
>> with
>> >>> another way to implement idempotent rename on GCS.
>> >>>
>> >>> What's your idea of how a safe retryable GCS rename() could work?
>> >>>
>> >>> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:
>> 
>>  Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES
>> is
>>  unsafe because it will skip (src, dst) pairs where neither exist?
>> (it only
>>  looks if src exists)
>> 
>>  For GCS, we can construct a safe retryable rename() operation,
>> assuming
>>  that copy() and delete() are atomic for a single file or pair of
>> files.
>> 
>> 
>> 
>>  On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi 
>> wrote:
>> >
>> > On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov
>> >  wrote:
>> >>
>> >> As far as I know, the current implementation of file sinks is the
>> only
>> >> reason why the flag IGNORE_MISSING for copying even exists -
>> there's no
>> >> other compelling reason to justify it. We implement "rename" as
>> "copy, then
>> >> delete" (in a single DoFn), so for idempodency of this operation
>> we need to
>> >> ignore the copying of a non-existent file.
>> >>
>> >> I think the right way to go would be to change the implementation
>> of
>> >> renaming to have a @RequiresStableInput (or reshuffle) in the
>> middle, so
>> >> it's made of 2 individually idempotent operations:
>> >> 1) copy, which would fail if input is missing, and would overwrite
>> >> output if it exists
>> >> -- reshuffle --
>> >> 2) delete, which would not fail if input is missing.
>> >
>> >
>> > Something like this is needed only in streaming, right?
>> >
>> > Raghu.
>> >
>> >>
>> >> That way first everything is copied (possibly via multiple
>> attempts),
>> >> and then old files are deleted (possibly via multiple attempts).
>> >>
>> >> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri 
>> wrote:
>> >>>
>> >>> I agree that overwriting is more in line with user expectations.
>> >>> I believe that the sink should not ignore errors from the
>> filesystem
>> >>> layer. Instead, the FileSystem API should be more well defined.
>> >>> Examples: rename() and copy() should overwrite existing files at
>> the
>> >>> destination, copy() should have an ignore_missing flag.
>> >>>
>> >>> On Wed, Jan 31, 

Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Raghu Angadi
In a batch pipeline, is it considered a data loss if the the stage fails
(assuming it does not set IGNORE_MISSING_FILES and fails hard)? If not, it
might be better to favor correctness and fail in current implementation.

On Thu, Feb 1, 2018 at 4:07 PM, Robert Bradshaw  wrote:

> You could add a step to delete all of dest before a barrier and the
> step that does the rename as outlined. In that case, any dest file
> that exists must be good.
>
> On Thu, Feb 1, 2018 at 2:52 PM, Eugene Kirpichov 
> wrote:
> > I think this is still unsafe in case exists(dst) (e.g. this is a re-run
> of a
> > pipeline) but src is missing due to some bad reason. However it's
> probably
> > better than what we have (e.g. we currently certainly don't perform
> checksum
> > checks).
> >
> > On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri  wrote:
> >>
> >> For GCS, I would do what I believe we already do.
> >> rename(src, dst):
> >> - if !exists(src) and exists(dst) return 0
> >> - if !exists(src) and !exists(dst) return error
> >> - if exists(src) and exists(dst) { if checksum(src) == checksum(dst)
> >> return 0 else delete(dst) }
> >> - Start a GCS copy from src to dst.
> >> - Wait for GCS copy to complete.
> >> - delete(src)
> >>
> >> For filesystems that don't have checksum() metadata, size() can be used
> >> instead.
> >>
> >> I've opened a bug to track this:
> >> https://issues.apache.org/jira/browse/BEAM-3600
> >>
> >> On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov 
> >> wrote:
> >>>
> >>> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore
> files
> >>> that are missing for more ominous reasons than just this being a
> non-first
> >>> attempt at renaming src to dst. E.g. if there was a bug in
> constructing the
> >>> filename to be renamed, or if we somehow messed up the order of rename
> vs
> >>> cleanup, etc. - these situations with IGNORE_MISSING_FILES would lead
> to
> >>> silent data loss (likely caught by unit tests though - so this is not a
> >>> super serious issue).
> >>>
> >>> Basically I just can't think of a case when I was copying files and
> >>> thinking "oh man, I wish it didn't give an error if the stuff I'm
> copying
> >>> doesn't exist" - the option exists only because we couldn't come up
> with
> >>> another way to implement idempotent rename on GCS.
> >>>
> >>> What's your idea of how a safe retryable GCS rename() could work?
> >>>
> >>> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:
> 
>  Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES
> is
>  unsafe because it will skip (src, dst) pairs where neither exist? (it
> only
>  looks if src exists)
> 
>  For GCS, we can construct a safe retryable rename() operation,
> assuming
>  that copy() and delete() are atomic for a single file or pair of
> files.
> 
> 
> 
>  On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi 
> wrote:
> >
> > On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov
> >  wrote:
> >>
> >> As far as I know, the current implementation of file sinks is the
> only
> >> reason why the flag IGNORE_MISSING for copying even exists -
> there's no
> >> other compelling reason to justify it. We implement "rename" as
> "copy, then
> >> delete" (in a single DoFn), so for idempodency of this operation we
> need to
> >> ignore the copying of a non-existent file.
> >>
> >> I think the right way to go would be to change the implementation of
> >> renaming to have a @RequiresStableInput (or reshuffle) in the
> middle, so
> >> it's made of 2 individually idempotent operations:
> >> 1) copy, which would fail if input is missing, and would overwrite
> >> output if it exists
> >> -- reshuffle --
> >> 2) delete, which would not fail if input is missing.
> >
> >
> > Something like this is needed only in streaming, right?
> >
> > Raghu.
> >
> >>
> >> That way first everything is copied (possibly via multiple
> attempts),
> >> and then old files are deleted (possibly via multiple attempts).
> >>
> >> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
> >>>
> >>> I agree that overwriting is more in line with user expectations.
> >>> I believe that the sink should not ignore errors from the
> filesystem
> >>> layer. Instead, the FileSystem API should be more well defined.
> >>> Examples: rename() and copy() should overwrite existing files at
> the
> >>> destination, copy() should have an ignore_missing flag.
> >>>
> >>> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi 
> >>> wrote:
> 
>  Original mail mentions that output from second run of word_count
> is
>  ignored. That does not seem as safe as ignoring error from a
> second attempt
>  of a step. 

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Romain Manni-Bucau
well we can disagree on the code - it is fine ;), but the needed part of it
by beam is not huge and in any case it can be forked without requiring 10
classes - if so we'll use another impl than the guava one ;). This is the
whole point.


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-02-02 17:24 GMT+01:00 Reuven Lax :

> TypeToken is not trivial. I've written code to do what TypeToken does
> before (figuring out generic ancestor types). It's actually somewhat
> tricky, and the code I wrote had subtle bugs in it; eventually we removed
> this code in favor of Guava's implementation :)
>
> On Fri, Feb 2, 2018 at 7:47 AM, Romain Manni-Bucau 
> wrote:
>
>> Yep, note I never said to reinvent the wheel, we can copy it from guava,
>> openwebbeans or any other impl. Point was more to avoid to depend on
>> something we don't own for that which is after all not that much code. I
>> also think we can limit it a lot to align it on what is supported by beam
>> (I'm thinking to coders here) but this can be another topic.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-02-02 16:33 GMT+01:00 Kenneth Knowles :
>>
>>> On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Don't forget beam doesn't support much behind it (mainly only a few
 ParameterizedType due the usage code path) so it is mainly only about
 handling parameterized types and typevariables recursively. Not a lot of
 work. Main concern being it is in the API so using a shade as an API is
 never a good idea, in particular cause the shade can be broken and requires
 to setup clirr or things like that and when it breaks you are done and need
 to fork it anyway. Limiting the dependencies for an API - as beam is - is
 always saner even if it requires a small fork of code.

>>>
>>> The main thing that TypeToken is used for is capturing generics that are
>>> lost by Java reflection. It is a bit tricky, actually.
>>>
>>> Kenn
>>>
>>>
>>>

 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 

 2018-02-02 15:49 GMT+01:00 Kenneth Knowles :

> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>>
>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles :
>>
>>> Another couple:
>>>
>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
>>> TypeToken
>>>
>>
>> Technically reflect Type is enough
>>
>
> If you want to try to remove TypeToken from underneath TypeDescriptor,
> I have no objections as long as you expand the test suite to cover all the
> functionality where we trust TypeToken's tests. Good luck :-)
>
> Kenn
>
>
>>
>>
>>>  - ImmutableList and friends and their builders are very widely used
>>> and IMO still add a lot for readability and preventing someone coming 
>>> along
>>> and adding mistakes to a codebase
>>>
>>
>> Sugar but not required. When you compare the cost of a shade versus
>> of duplicating the parts we need there is no real match IMHO.
>>
>>
>>>
>>> So considering it all, I would keep a vendored Guava (but also move
>>> off where we can, and also have our own improvements). I hope it will 
>>> be a
>>> near-empty build file to generate it so not a maintenance burden.
>>>
>>> Kenn
>>>
>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles 
>>> wrote:
>>>
 Nice. This sounds like a great idea (or two?) and goes along with
 what I just started for futures.

 Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
 assigned to Ismaël :-) and converted my futures thing to a subtask.

 Specific things for our micro guava:

  - checkArgumentNotNull can throw the right exception
  - our own Optional because Java's is not Serializable

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Reuven Lax
TypeToken is not trivial. I've written code to do what TypeToken does
before (figuring out generic ancestor types). It's actually somewhat
tricky, and the code I wrote had subtle bugs in it; eventually we removed
this code in favor of Guava's implementation :)

On Fri, Feb 2, 2018 at 7:47 AM, Romain Manni-Bucau 
wrote:

> Yep, note I never said to reinvent the wheel, we can copy it from guava,
> openwebbeans or any other impl. Point was more to avoid to depend on
> something we don't own for that which is after all not that much code. I
> also think we can limit it a lot to align it on what is supported by beam
> (I'm thinking to coders here) but this can be another topic.
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-02-02 16:33 GMT+01:00 Kenneth Knowles :
>
>> On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau > > wrote:
>>
>>> Don't forget beam doesn't support much behind it (mainly only a few
>>> ParameterizedType due the usage code path) so it is mainly only about
>>> handling parameterized types and typevariables recursively. Not a lot of
>>> work. Main concern being it is in the API so using a shade as an API is
>>> never a good idea, in particular cause the shade can be broken and requires
>>> to setup clirr or things like that and when it breaks you are done and need
>>> to fork it anyway. Limiting the dependencies for an API - as beam is - is
>>> always saner even if it requires a small fork of code.
>>>
>>
>> The main thing that TypeToken is used for is capturing generics that are
>> lost by Java reflection. It is a bit tricky, actually.
>>
>> Kenn
>>
>>
>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles :
>>>
 On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles :
>
>> Another couple:
>>
>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
>> TypeToken
>>
>
> Technically reflect Type is enough
>

 If you want to try to remove TypeToken from underneath TypeDescriptor,
 I have no objections as long as you expand the test suite to cover all the
 functionality where we trust TypeToken's tests. Good luck :-)

 Kenn


>
>
>>  - ImmutableList and friends and their builders are very widely used
>> and IMO still add a lot for readability and preventing someone coming 
>> along
>> and adding mistakes to a codebase
>>
>
> Sugar but not required. When you compare the cost of a shade versus of
> duplicating the parts we need there is no real match IMHO.
>
>
>>
>> So considering it all, I would keep a vendored Guava (but also move
>> off where we can, and also have our own improvements). I hope it will be 
>> a
>> near-empty build file to generate it so not a maintenance burden.
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles 
>> wrote:
>>
>>> Nice. This sounds like a great idea (or two?) and goes along with
>>> what I just started for futures.
>>>
>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>
>>> Specific things for our micro guava:
>>>
>>>  - checkArgumentNotNull can throw the right exception
>>>  - our own Optional because Java's is not Serializable
>>>  - futures combinators since many are missing, especially Java's
>>> don't do exceptions right
>>>
>>> Protobuf: didn't file an issue because I'm not sure
>>>
>>> I was wondering if pre-shading works. We really need to get rid of
>>> it from public APIs in a 100% reliable way. It is also a problem for
>>> Dataflow. I was wondering if one approach is to pre-shade
>>> gcpio-protobuf-java, gcpio-protobuf-java-util, gcpio-grpc-java, etc.
>>> Anything that needs to take a Message object. (and do the same for
>>> beam-model-protobuf-java since the model bits have to depend on each 
>>> other
>>> but nothing else).
>>>
>>> Kenn
>>>
>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía 

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Romain Manni-Bucau
Yep, note I never said to reinvent the wheel, we can copy it from guava,
openwebbeans or any other impl. Point was more to avoid to depend on
something we don't own for that which is after all not that much code. I
also think we can limit it a lot to align it on what is supported by beam
(I'm thinking to coders here) but this can be another topic.


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-02-02 16:33 GMT+01:00 Kenneth Knowles :

> On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau 
> wrote:
>
>> Don't forget beam doesn't support much behind it (mainly only a few
>> ParameterizedType due the usage code path) so it is mainly only about
>> handling parameterized types and typevariables recursively. Not a lot of
>> work. Main concern being it is in the API so using a shade as an API is
>> never a good idea, in particular cause the shade can be broken and requires
>> to setup clirr or things like that and when it breaks you are done and need
>> to fork it anyway. Limiting the dependencies for an API - as beam is - is
>> always saner even if it requires a small fork of code.
>>
>
> The main thing that TypeToken is used for is capturing generics that are
> lost by Java reflection. It is a bit tricky, actually.
>
> Kenn
>
>
>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles :
>>
>>> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>


 2018-02-02 15:37 GMT+01:00 Kenneth Knowles :

> Another couple:
>
>  - User-facing TypeDescriptor is a very thin wrapper on Guava's
> TypeToken
>

 Technically reflect Type is enough

>>>
>>> If you want to try to remove TypeToken from underneath TypeDescriptor, I
>>> have no objections as long as you expand the test suite to cover all the
>>> functionality where we trust TypeToken's tests. Good luck :-)
>>>
>>> Kenn
>>>
>>>


>  - ImmutableList and friends and their builders are very widely used
> and IMO still add a lot for readability and preventing someone coming 
> along
> and adding mistakes to a codebase
>

 Sugar but not required. When you compare the cost of a shade versus of
 duplicating the parts we need there is no real match IMHO.


>
> So considering it all, I would keep a vendored Guava (but also move
> off where we can, and also have our own improvements). I hope it will be a
> near-empty build file to generate it so not a maintenance burden.
>
> Kenn
>
> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles 
> wrote:
>
>> Nice. This sounds like a great idea (or two?) and goes along with
>> what I just started for futures.
>>
>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>
>> Specific things for our micro guava:
>>
>>  - checkArgumentNotNull can throw the right exception
>>  - our own Optional because Java's is not Serializable
>>  - futures combinators since many are missing, especially Java's
>> don't do exceptions right
>>
>> Protobuf: didn't file an issue because I'm not sure
>>
>> I was wondering if pre-shading works. We really need to get rid of it
>> from public APIs in a 100% reliable way. It is also a problem for 
>> Dataflow.
>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to 
>> take
>> a Message object. (and do the same for beam-model-protobuf-java since the
>> model bits have to depend on each other but nothing else).
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía 
>> wrote:
>>
>>> Huge +1 to get rid of Guava!
>>>
>>> This solves annoying dependency issues for some IOs and allow us to
>>> get rid of the shading that makes current jars bigger than they
>>> should
>>> be.
>>>
>>> We can create our own 'micro guava' package with some classes for
>>> things that are hard to migrate, or that we  prefer to still have
>>> like
>>> the check* methods for example. Given the size of 

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Kenneth Knowles
On Fri, Feb 2, 2018 at 7:18 AM, Romain Manni-Bucau 
wrote:

> Don't forget beam doesn't support much behind it (mainly only a few
> ParameterizedType due the usage code path) so it is mainly only about
> handling parameterized types and typevariables recursively. Not a lot of
> work. Main concern being it is in the API so using a shade as an API is
> never a good idea, in particular cause the shade can be broken and requires
> to setup clirr or things like that and when it breaks you are done and need
> to fork it anyway. Limiting the dependencies for an API - as beam is - is
> always saner even if it requires a small fork of code.
>

The main thing that TypeToken is used for is capturing generics that are
lost by Java reflection. It is a bit tricky, actually.

Kenn



>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-02-02 15:49 GMT+01:00 Kenneth Knowles :
>
>> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau > > wrote:
>>
>>>
>>>
>>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles :
>>>
 Another couple:

  - User-facing TypeDescriptor is a very thin wrapper on Guava's
 TypeToken

>>>
>>> Technically reflect Type is enough
>>>
>>
>> If you want to try to remove TypeToken from underneath TypeDescriptor, I
>> have no objections as long as you expand the test suite to cover all the
>> functionality where we trust TypeToken's tests. Good luck :-)
>>
>> Kenn
>>
>>
>>>
>>>
  - ImmutableList and friends and their builders are very widely used
 and IMO still add a lot for readability and preventing someone coming along
 and adding mistakes to a codebase

>>>
>>> Sugar but not required. When you compare the cost of a shade versus of
>>> duplicating the parts we need there is no real match IMHO.
>>>
>>>

 So considering it all, I would keep a vendored Guava (but also move off
 where we can, and also have our own improvements). I hope it will be a
 near-empty build file to generate it so not a maintenance burden.

 Kenn

 On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles  wrote:

> Nice. This sounds like a great idea (or two?) and goes along with what
> I just started for futures.
>
> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
> assigned to Ismaël :-) and converted my futures thing to a subtask.
>
> Specific things for our micro guava:
>
>  - checkArgumentNotNull can throw the right exception
>  - our own Optional because Java's is not Serializable
>  - futures combinators since many are missing, especially Java's don't
> do exceptions right
>
> Protobuf: didn't file an issue because I'm not sure
>
> I was wondering if pre-shading works. We really need to get rid of it
> from public APIs in a 100% reliable way. It is also a problem for 
> Dataflow.
> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to 
> take
> a Message object. (and do the same for beam-model-protobuf-java since the
> model bits have to depend on each other but nothing else).
>
> Kenn
>
> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía 
> wrote:
>
>> Huge +1 to get rid of Guava!
>>
>> This solves annoying dependency issues for some IOs and allow us to
>> get rid of the shading that makes current jars bigger than they should
>> be.
>>
>> We can create our own 'micro guava' package with some classes for
>> things that are hard to migrate, or that we  prefer to still have like
>> the check* methods for example. Given the size of the task we should
>> probably divide it into subtasks, more important is to get rid of it
>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>
>> Other important idea would be to get rid of Protobuf in public APIs
>> like GCPIO and to better shade it from leaking into the runners. An
>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>> that is byting us for the Spark runner, but well that's worth a
>> different discussion.
>>
>>
>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>  wrote:
>> > a map of list is fine and not a challenge we'll face long I hope ;)
>> >
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>> >
>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax :
>> >>
>> >> Not sure 

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Romain Manni-Bucau
Don't forget beam doesn't support much behind it (mainly only a few
ParameterizedType due the usage code path) so it is mainly only about
handling parameterized types and typevariables recursively. Not a lot of
work. Main concern being it is in the API so using a shade as an API is
never a good idea, in particular cause the shade can be broken and requires
to setup clirr or things like that and when it breaks you are done and need
to fork it anyway. Limiting the dependencies for an API - as beam is - is
always saner even if it requires a small fork of code.


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-02-02 15:49 GMT+01:00 Kenneth Knowles :

> On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau 
> wrote:
>
>>
>>
>> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles :
>>
>>> Another couple:
>>>
>>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
>>>
>>
>> Technically reflect Type is enough
>>
>
> If you want to try to remove TypeToken from underneath TypeDescriptor, I
> have no objections as long as you expand the test suite to cover all the
> functionality where we trust TypeToken's tests. Good luck :-)
>
> Kenn
>
>
>>
>>
>>>  - ImmutableList and friends and their builders are very widely used and
>>> IMO still add a lot for readability and preventing someone coming along and
>>> adding mistakes to a codebase
>>>
>>
>> Sugar but not required. When you compare the cost of a shade versus of
>> duplicating the parts we need there is no real match IMHO.
>>
>>
>>>
>>> So considering it all, I would keep a vendored Guava (but also move off
>>> where we can, and also have our own improvements). I hope it will be a
>>> near-empty build file to generate it so not a maintenance burden.
>>>
>>> Kenn
>>>
>>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles  wrote:
>>>
 Nice. This sounds like a great idea (or two?) and goes along with what
 I just started for futures.

 Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
 assigned to Ismaël :-) and converted my futures thing to a subtask.

 Specific things for our micro guava:

  - checkArgumentNotNull can throw the right exception
  - our own Optional because Java's is not Serializable
  - futures combinators since many are missing, especially Java's don't
 do exceptions right

 Protobuf: didn't file an issue because I'm not sure

 I was wondering if pre-shading works. We really need to get rid of it
 from public APIs in a 100% reliable way. It is also a problem for Dataflow.
 I was wondering if one approach is to pre-shade gcpio-protobuf-java,
 gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
 a Message object. (and do the same for beam-model-protobuf-java since the
 model bits have to depend on each other but nothing else).

 Kenn

 On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía  wrote:

> Huge +1 to get rid of Guava!
>
> This solves annoying dependency issues for some IOs and allow us to
> get rid of the shading that makes current jars bigger than they should
> be.
>
> We can create our own 'micro guava' package with some classes for
> things that are hard to migrate, or that we  prefer to still have like
> the check* methods for example. Given the size of the task we should
> probably divide it into subtasks, more important is to get rid of it
> for 'sdks/java/core'. We can then attack other areas afterwards.
>
> Other important idea would be to get rid of Protobuf in public APIs
> like GCPIO and to better shade it from leaking into the runners. An
> unexpected side effect of this is a leak of netty via gRPC/protobuf
> that is byting us for the Spark runner, but well that's worth a
> different discussion.
>
>
> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>  wrote:
> > a map of list is fine and not a challenge we'll face long I hope ;)
> >
> >
> > Romain Manni-Bucau
> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
> >
> > 2018-02-01 9:40 GMT+01:00 Reuven Lax :
> >>
> >> Not sure we'll be able to replace them all. Things like guava Table
> and
> >> Multimap don't have great replacements in Java8.
> >>
> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> >> wrote:
> >>>
> >>> +1, it was on my TODO for a while waiting the Java8 update.
> >>>
> >>> Regards
> >>> JB
> >>>
> 

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Kenneth Knowles
On Fri, Feb 2, 2018 at 6:41 AM, Romain Manni-Bucau 
wrote:

>
>
> 2018-02-02 15:37 GMT+01:00 Kenneth Knowles :
>
>> Another couple:
>>
>>  - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
>>
>
> Technically reflect Type is enough
>

If you want to try to remove TypeToken from underneath TypeDescriptor, I
have no objections as long as you expand the test suite to cover all the
functionality where we trust TypeToken's tests. Good luck :-)

Kenn


>
>
>>  - ImmutableList and friends and their builders are very widely used and
>> IMO still add a lot for readability and preventing someone coming along and
>> adding mistakes to a codebase
>>
>
> Sugar but not required. When you compare the cost of a shade versus of
> duplicating the parts we need there is no real match IMHO.
>
>
>>
>> So considering it all, I would keep a vendored Guava (but also move off
>> where we can, and also have our own improvements). I hope it will be a
>> near-empty build file to generate it so not a maintenance burden.
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles  wrote:
>>
>>> Nice. This sounds like a great idea (or two?) and goes along with what I
>>> just started for futures.
>>>
>>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>>
>>> Specific things for our micro guava:
>>>
>>>  - checkArgumentNotNull can throw the right exception
>>>  - our own Optional because Java's is not Serializable
>>>  - futures combinators since many are missing, especially Java's don't
>>> do exceptions right
>>>
>>> Protobuf: didn't file an issue because I'm not sure
>>>
>>> I was wondering if pre-shading works. We really need to get rid of it
>>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>>> a Message object. (and do the same for beam-model-protobuf-java since the
>>> model bits have to depend on each other but nothing else).
>>>
>>> Kenn
>>>
>>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía  wrote:
>>>
 Huge +1 to get rid of Guava!

 This solves annoying dependency issues for some IOs and allow us to
 get rid of the shading that makes current jars bigger than they should
 be.

 We can create our own 'micro guava' package with some classes for
 things that are hard to migrate, or that we  prefer to still have like
 the check* methods for example. Given the size of the task we should
 probably divide it into subtasks, more important is to get rid of it
 for 'sdks/java/core'. We can then attack other areas afterwards.

 Other important idea would be to get rid of Protobuf in public APIs
 like GCPIO and to better shade it from leaking into the runners. An
 unexpected side effect of this is a leak of netty via gRPC/protobuf
 that is byting us for the Spark runner, but well that's worth a
 different discussion.


 On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
  wrote:
 > a map of list is fine and not a challenge we'll face long I hope ;)
 >
 >
 > Romain Manni-Bucau
 > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
 >
 > 2018-02-01 9:40 GMT+01:00 Reuven Lax :
 >>
 >> Not sure we'll be able to replace them all. Things like guava Table
 and
 >> Multimap don't have great replacements in Java8.
 >>
 >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
 j...@nanthrax.net>
 >> wrote:
 >>>
 >>> +1, it was on my TODO for a while waiting the Java8 update.
 >>>
 >>> Regards
 >>> JB
 >>>
 >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
 >>> > Why not dropping guava for all beam codebase? With java 8 it is
 quite
 >>> > easy to do
 >>> > it and avoid a bunch of conflicts. Did it in 2 projects with
 quite a
 >>> > good result.
 >>> >
 >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" >> > > a écrit :
 >>> >
 >>> > Make sure to include the guava version in the artifact name
 so that
 >>> > we can
 >>> > have multiple vendored versions.
 >>> >
 >>> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
 k...@google.com
 >>> > > wrote:
 >>> >
 >>> > I didn't have time for this, but it just bit me. We
 definitely
 >>> > have
 >>> > Guava on the API surface of runner support code in ways
 that
 >>> > get
 >>> > incompatibly shaded. I will probably start "1a" by making
 a
 >>> > shaded
 >>> > library org.apache.beam:vendored-guava 

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-02 Thread Romain Manni-Bucau
2018-02-02 15:37 GMT+01:00 Kenneth Knowles :

> Another couple:
>
>  - User-facing TypeDescriptor is a very thin wrapper on Guava's TypeToken
>

Technically reflect Type is enough


>  - ImmutableList and friends and their builders are very widely used and
> IMO still add a lot for readability and preventing someone coming along and
> adding mistakes to a codebase
>

Sugar but not required. When you compare the cost of a shade versus of
duplicating the parts we need there is no real match IMHO.


>
> So considering it all, I would keep a vendored Guava (but also move off
> where we can, and also have our own improvements). I hope it will be a
> near-empty build file to generate it so not a maintenance burden.
>
> Kenn
>
> On Thu, Feb 1, 2018 at 8:44 PM, Kenneth Knowles  wrote:
>
>> Nice. This sounds like a great idea (or two?) and goes along with what I
>> just started for futures.
>>
>> Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and
>> assigned to Ismaël :-) and converted my futures thing to a subtask.
>>
>> Specific things for our micro guava:
>>
>>  - checkArgumentNotNull can throw the right exception
>>  - our own Optional because Java's is not Serializable
>>  - futures combinators since many are missing, especially Java's don't do
>> exceptions right
>>
>> Protobuf: didn't file an issue because I'm not sure
>>
>> I was wondering if pre-shading works. We really need to get rid of it
>> from public APIs in a 100% reliable way. It is also a problem for Dataflow.
>> I was wondering if one approach is to pre-shade gcpio-protobuf-java,
>> gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
>> a Message object. (and do the same for beam-model-protobuf-java since the
>> model bits have to depend on each other but nothing else).
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía  wrote:
>>
>>> Huge +1 to get rid of Guava!
>>>
>>> This solves annoying dependency issues for some IOs and allow us to
>>> get rid of the shading that makes current jars bigger than they should
>>> be.
>>>
>>> We can create our own 'micro guava' package with some classes for
>>> things that are hard to migrate, or that we  prefer to still have like
>>> the check* methods for example. Given the size of the task we should
>>> probably divide it into subtasks, more important is to get rid of it
>>> for 'sdks/java/core'. We can then attack other areas afterwards.
>>>
>>> Other important idea would be to get rid of Protobuf in public APIs
>>> like GCPIO and to better shade it from leaking into the runners. An
>>> unexpected side effect of this is a leak of netty via gRPC/protobuf
>>> that is byting us for the Spark runner, but well that's worth a
>>> different discussion.
>>>
>>>
>>> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>>>  wrote:
>>> > a map of list is fine and not a challenge we'll face long I hope ;)
>>> >
>>> >
>>> > Romain Manni-Bucau
>>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>>> >
>>> > 2018-02-01 9:40 GMT+01:00 Reuven Lax :
>>> >>
>>> >> Not sure we'll be able to replace them all. Things like guava Table
>>> and
>>> >> Multimap don't have great replacements in Java8.
>>> >>
>>> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré <
>>> j...@nanthrax.net>
>>> >> wrote:
>>> >>>
>>> >>> +1, it was on my TODO for a while waiting the Java8 update.
>>> >>>
>>> >>> Regards
>>> >>> JB
>>> >>>
>>> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>> >>> > Why not dropping guava for all beam codebase? With java 8 it is
>>> quite
>>> >>> > easy to do
>>> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite
>>> a
>>> >>> > good result.
>>> >>> >
>>> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" >> >>> > > a écrit :
>>> >>> >
>>> >>> > Make sure to include the guava version in the artifact name so
>>> that
>>> >>> > we can
>>> >>> > have multiple vendored versions.
>>> >>> >
>>> >>> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles <
>>> k...@google.com
>>> >>> > > wrote:
>>> >>> >
>>> >>> > I didn't have time for this, but it just bit me. We
>>> definitely
>>> >>> > have
>>> >>> > Guava on the API surface of runner support code in ways
>>> that
>>> >>> > get
>>> >>> > incompatibly shaded. I will probably start "1a" by making a
>>> >>> > shaded
>>> >>> > library org.apache.beam:vendored-guava and starting to use
>>> it.
>>> >>> > It sounds
>>> >>> > like there is generally unanimous support for that much,
>>> >>> > anyhow.
>>> >>> >
>>> >>> > Kenn
>>> >>> >
>>> >>> > On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>> >>> > >> >>> > > wrote:
>>> >>> >
>>> >>> > Thanks Ismaël for bringing up this discussion again!
>>> >>> >
>>> >>> > 

Build failed in Jenkins: beam_PostRelease_NightlySnapshot #14

2018-02-02 Thread Apache Jenkins Server
See 


Changes:

[ekirpichov] Introduces the Wait transform

[ehudm] Split out buffered read and write code from gcsio.

[github] Fix undefined names: exc_info --> self.exc_info

[github] import logging for line 1163

[dkulp] [BEAM-3562] Update to Checkstyle 8.7

[klk] Encourage a good description in a good spot on a PR description.

[lcwik] Change info to debug statement

[tgroh] Add QueryablePipeline

[gene] Changing FileNaming to public to allow for usage in lambdas/inheritance

--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on beam4 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 9cf86bcebcdbd8d5a84777cf2871597f0ba1b951 (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9cf86bcebcdbd8d5a84777cf2871597f0ba1b951
Commit message: "Merge pull request #4301: Introduces the Wait transform"
 > git rev-list 82e5e944ba143c23acf377bfd7a850046be68ea7 # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
[EnvInject] - Executing scripts and injecting environment variables after the 
SCM step.
[EnvInject] - Injecting as environment variables the properties content 
SPARK_LOCAL_IP=127.0.0.1

[EnvInject] - Variables injected successfully.
[beam_PostRelease_NightlySnapshot] $ /bin/bash -xe 
/tmp/jenkins6865322189646170951.sh
+ cd src/release
+ groovy quickstart-java-direct.groovy
/tmp/jenkins6865322189646170951.sh: line 2: groovy: command not found
Build step 'Execute shell' marked build as failure
Not sending mail to unregistered user git...@alasdairhodge.co.uk
Not sending mail to unregistered user xuming...@users.noreply.github.com
Not sending mail to unregistered user g...@telligent-data.com
Not sending mail to unregistered user ke...@google.com
Not sending mail to unregistered user w...@google.com
Not sending mail to unregistered user ekirpic...@gmail.com
Not sending mail to unregistered user z...@giggles.nyc.corp.google.com
Not sending mail to unregistered user kirpic...@google.com
Not sending mail to unregistered user mott...@gmail.com
Not sending mail to unregistered user k...@google.com
Not sending mail to unregistered user eh...@google.com
Not sending mail to unregistered user aromanenko@gmail.com
Not sending mail to unregistered user mari...@mariagh.svl.corp.google.com
Not sending mail to unregistered user pawel.pk.kaczmarc...@gmail.com
Not sending mail to unregistered user dariusz.aniszew...@polidea.com
Not sending mail to unregistered user joey.bar...@gmail.com


Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-02 Thread Holden Karau
For what it's worth there exists a relatively easy Java8 to Scala future
conversion so this shouldn't cause an issue on the Spark runner.

On Thu, Feb 1, 2018 at 11:22 PM, Alexey Romanenko 
wrote:

> +1, sounds great!
>
> Regards,
> Alexey
>
>
> On 2 Feb 2018, at 07:14, Thomas Weise  wrote:
>
> +1
>
>
> On Thu, Feb 1, 2018 at 9:07 PM, Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> Regards
>> JB
>>
>> On 02/01/2018 07:54 PM, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > Luke, Thomas, and I had some in-person discussions about the use of
>> Java 8
>> > futures and Guava futures in the portability support code. I wanted to
>> bring our
>> > thoughts to the dev list for feedback.
>> >
>> > As background:
>> >
>> >  - Java 5+ "Future" lacks the main purpose of future, which is async
>> chaining.
>> >  - Guava introduced ListenableFuture to do real future-oriented
>> programming
>> >  - Java 8 added CompletionStage which is more-or-less the expected
>> interface
>> >
>> > It is still debatable whether Java got it right [1]. But since it is
>> > standardized, doesn't need to be shaded, etc, it is worth trying to
>> just use it
>> > carefully in the right ways. So we thought to propose that we migrate
>> most uses
>> > of Guava futures to Java 8 futures.
>> >
>> > What do you think? Have we missed an important problem that would make
>> this a
>> > deal-breaker?
>> >
>> > Kenn
>> >
>> > [1]
>> > e.g. https://stackoverflow.com/questions/38744943/listenable
>> future-vs-completablefuture#comment72041244_39250452
>> > and such discussions are likely to occur whenever you bring it up with
>> someone
>> > who cares a lot about futures :-)
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>


-- 
Twitter: https://twitter.com/holdenkarau