date:20170530

Re: Precommit Jenkins Linkage Broken

2017-05-30 Thread Ted Yu

INFRA-14247 is currently marked Major.

Suggest raising the priority so that it gets more attention.

Cheers

On Tue, May 30, 2017 at 2:59 PM, Jason Kuster <
jasonkus...@google.com.invalid> wrote:

> Hey folks,
>
> Just wanted to mention on the dev list that Jenkins precommit breakage is a
> known issue and has been escalated to Infra (thanks JB!)[1]. I'm monitoring
> the issue and will ping back here with any updates and when it starts
> working again.
>
> Best,
>
> Jason
>
> [1] https://issues.apache.org/jira/browse/INFRA-14247
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>

Precommit Jenkins Linkage Broken

2017-05-30 Thread Jason Kuster

Hey folks,

Just wanted to mention on the dev list that Jenkins precommit breakage is a
known issue and has been escalated to Infra (thanks JB!)[1]. I'm monitoring
the issue and will ping back here with any updates and when it starts
working again.

Best,

Jason

[1] https://issues.apache.org/jira/browse/INFRA-14247

-- 
---
Jason Kuster
Apache Beam / Google Cloud Dataflow

Re: [DISCUSS] HadoopInputFormat based IOs

2017-05-30 Thread Stephen Sisk

Ah, thanks for clarifying ismael.

I think you would agree that we need to have integration testing of HIFIO.
Cassandra and ES are currently the only ITs for HIFIO. If we want to write
ITs for HIFIO that don't rely on ES/Cassandra with the idea that we'd
remove ES/Cassandra, I could be okay with that. The data store in question
would need to have both small & large k8s cluster scripts so that we can do
small & large integration tests (since that's what's currently supported
with HIFIO today and I don't think we should go backwards.)

The reason I hesitate to use a data store that doesn't have a native
implementation is that we can use ES/Cassandra's native write transform to
eventually switch HIFIO ITs to the new writeThenRead style IO IT [1] that
will *drastically* simplify maintenance requirements for the HIFIO tests.
WriteThenRead writes the test data inside of the test, thus removing the
requirement for a separate data loading step outside of the step. We
*could* write inside the test setup code (thus running only on one
machine), but for larger data amounts, that takes too long - it's easier to
do the write using the IO, which runs in parallel, and thus is a lot
quicker. That means we need a data store that has a native, parallelizable
write.

What do you think? Basically, I agree with you in principal, but given that
using a data store without a native implementation either a separate data
loading step or slower tests, I'd strongly prefer to keep using
ES/Cassandra. (you could make the case that we should remove one of them.
I'm not attached to keeping both.)

> having [ES/Cassandra HIFIO read-code] in the source code base would [not]
be
consistent with the ideas of the previous paragraph.
I do agree with this. If we keep the ES/Cassandra HIFIO test code, I'd
propose that we add comments in there directing people to the correct
native source.

S
[1] writeThenRead style IO IT -
https://lists.apache.org/thread.html/26ee3ba827c2917c393ab26ce97e7491846594d8f574b5ae29a44551@%3Cdev.beam.apache.org%3E

On Tue, May 30, 2017 at 1:47 PM Ismaël Mejía  wrote:

> The whole goal of this discussion is that we define what shall we do
> when someone wants to add a new IO that uses HIFIO. The consensus so
> far following the PR comments + this thread is that it should be
> discouraged and those contribution be included as documentation in the
> website, and that we should give priority to the native
> implementations, which seems reasonable (e,g, to encourage better
> implementations and avoid the maintenance burden).
>
> So, I was wondering what would be a good rule to justify that we have
> tests for some data stores as part of the tests of HIFIO and I don't
> see a strong reason to do this, in particular once those have native
> implementations, to be more clear, in the current case we have HIFIO
> tests (jdk1.8-tests) for Elasticsearch5 and Cassandra which both are
> not covered by the native IOs yet. However once the native IOs for
> both systems are merged I don't see any reason to keep the extra tests
> in HIFIO, because we will be doing a double effort to test an IO that
> is not native, and that does not support Write, so I think we should
> remove those. Also not having this in the source code base would be
> consistent with the ideas of the previous paragraph.
>
> But well maybe I am missing something here, do you see any strong
> reason to keep them.
>

RE: [DISCUSS] HadoopInputFormat based IOs

2017-05-30 Thread Seshadri Raghunathan

+1

I think this is a good way to streamline HIFIO and native IOs.

Regards,
Seshadri
408 601 7548

-Original Message-
From: Ismaël Mejía [mailto:ieme...@gmail.com] 
Sent: Tuesday, May 30, 2017 1:47 PM
To: dev@beam.apache.org
Subject: Re: [DISCUSS] HadoopInputFormat based IOs

The whole goal of this discussion is that we define what shall we do when 
someone wants to add a new IO that uses HIFIO. The consensus so far following 
the PR comments + this thread is that it should be discouraged and those 
contribution be included as documentation in the website, and that we should 
give priority to the native implementations, which seems reasonable (e,g, to 
encourage better implementations and avoid the maintenance burden).

So, I was wondering what would be a good rule to justify that we have tests for 
some data stores as part of the tests of HIFIO and I don't see a strong reason 
to do this, in particular once those have native implementations, to be more 
clear, in the current case we have HIFIO tests (jdk1.8-tests) for 
Elasticsearch5 and Cassandra which both are not covered by the native IOs yet. 
However once the native IOs for both systems are merged I don't see any reason 
to keep the extra tests in HIFIO, because we will be doing a double effort to 
test an IO that is not native, and that does not support Write, so I think we 
should remove those. Also not having this in the source code base would be 
consistent with the ideas of the previous paragraph.

But well maybe I am missing something here, do you see any strong reason to 
keep them.

Re: [DISCUSS] HadoopInputFormat based IOs

2017-05-30 Thread Ismaël Mejía

The whole goal of this discussion is that we define what shall we do
when someone wants to add a new IO that uses HIFIO. The consensus so
far following the PR comments + this thread is that it should be
discouraged and those contribution be included as documentation in the
website, and that we should give priority to the native
implementations, which seems reasonable (e,g, to encourage better
implementations and avoid the maintenance burden).

So, I was wondering what would be a good rule to justify that we have
tests for some data stores as part of the tests of HIFIO and I don't
see a strong reason to do this, in particular once those have native
implementations, to be more clear, in the current case we have HIFIO
tests (jdk1.8-tests) for Elasticsearch5 and Cassandra which both are
not covered by the native IOs yet. However once the native IOs for
both systems are merged I don't see any reason to keep the extra tests
in HIFIO, because we will be doing a double effort to test an IO that
is not native, and that does not support Write, so I think we should
remove those. Also not having this in the source code base would be
consistent with the ideas of the previous paragraph.

But well maybe I am missing something here, do you see any strong
reason to keep them.

Re: [DISCUSS] HadoopInputFormat based IOs

2017-05-30 Thread Stephen Sisk

Great, I'm glad to hear that. I filed BEAM-2388 to track the work
(currently unassigned)

> today we have Cassandra and Elasticsearch5 examples based
on HIF that will be clearly redundant once we have the native
versions, so they should maybe moved into the proposed website
section
Can you clarify what you're proposing removing? Are you saying that we
should remove the ES/cassandra examples from the HIFIO web page linked to
from the built-in page [2]?  I definitely agree with that, thanks for
pointing that out. (I don't think you're proposing removing the tests, eg
HIFIOWithEmbeddedCassandraTest)

S

[1] https://issues.apache.org/jira/browse/BEAM-2388
[2] https://beam.apache.org/documentation/io/built-in/hadoop/

On Tue, May 30, 2017 at 12:13 PM Ismaël Mejía  wrote:

> I agree 100% with Stephen points, I think that including a
> 'discoverability' section for these IOs that are shared by multiple
> data stores is a great step, in particular for the HIF ones.
>
> I would like that we define what would we do in concrete with the
> HIFIO based implementations of IOs once their native implementation is
> merged, e.g. today we have Cassandra and Elasticsearch5 examples based
> on HIF that will be clearly redundant once we have the native
> versions, so they should maybe moved into the proposed website
> section. What do you guys think?
>
> Any other ideas/comments on the general subject?
>
>
>
> On Tue, May 23, 2017 at 7:25 PM, Stephen Sisk 
> wrote:
> > hey,
> >
> > Thanks for bringing this up! It's definitely an interesting question and
> I
> > can see both sides of the argument.
> >
> > I can see the appeal of HIFIO wrapper IOs as stop-gaps and if they have
> > good test coverage, it does ensure that the HIFIO route is working. If we
> > have good IT coverage, it also means there's fewer steps involved in
> > building a native IO as well, since the ITs will already be written.
> >
> > However, I think I'm still assuming that the community will implement
> > native IOs for most data stores that users want to interact with, and
> thus
> > I'd still discourage building IOs that are just HIFIO/jdbc wrappers. I'd
> > personally rather devote time and resources to native IOs. If we don't
> see
> > traction on building more IOs then I'd be more open to it.
> >
> > If we do choose to go down this "Don't build HIFIO wrappers, just improve
> > discoverability" route, one idea I had floating around in my head was
> that
> > we might add a section to the Built-in IO Transforms page that covers
> > "non-native but readable" IOs (better name suggestions appreciated :) -
> > that could include a list of data stores that jdbc/jms/hifio support and
> > link to HIFIO's info on how to use them. (That might also be a good place
> > to document the performance tradeoffs of using HIFIO)
> >
> > S
> >
> >
> > On Tue, May 23, 2017 at 9:53 AM Ismaël Mejía  wrote:
> >
> >> Hello, I bring this subject to the mailing list to see everybody’s
> >> opinion on the subject.
> >>
> >> The recent inclusion of HadoopInputFormatIO (HiFiIO) gave Beam users
> >> the option to ‘easily’ include data stores that support the
> >> Hadoop-based partitioning scheme. There are currently examples of how
> >> to use it for example to read from Elasticsearch and Cassandra. In
> >> both cases we already have specific IOs on master or as WIP so using
> >> HiFiIO based IO is not needed.
> >>
> >> During the review of the recent IO for Hive (HCatalog) that uses
> >> HiFiIO instead of a native API, there was a discussion about the fact
> >> that this shouldn’t be included as a specific IO but better to add the
> >> tests/documentation of how to read Hive records using the existing
> >> HiFiIO. This makes sense from an abstraction point of view, however
> >> there are visibility issues since end users would need to repackage
> >> and discover the supported (and tested) HiFi-based IOs that won’t be
> >> explicit in the code base.
> >>
> >> I would like to know what other members of the community think about
> >> this, is it worth to have individual IOs based on HiFiIO for things
> >> that we currently don’t support (e.g. Hive or Amazon Redshift) (option
> >> 1) or maybe it is just better to add just the tests/docs of how to use
> >> them as proposed in the PR (option 2).
> >>
> >> Feel free to comment/vote or maybe add an eventual third option if you
> >> think there is one better option.
> >>
> >> Regards,
> >> Ismaël Mejía
> >>
> >> [1] https://issues.apache.org/jira/browse/BEAM-1158
> >>
>

Re: [INFO] Build fails on GCP IO (Spanner)

2017-05-30 Thread Jean-Baptiste Onofré


Got it.

Thanks for the details Luke !

Regards
JB

On 05/30/2017 09:30 PM, Lukasz Cwik wrote:

JB, the issue is that we have been careful so far to not require a GCP
project or credentials as part of a test until SpannerIO broke this.
BEAM-2131 is about having a stronger precommit if Jenkins ran in an
environment which better modeled a users machine (e.g. no GCP
project/credentials can be inferred automatically).

On Tue, May 30, 2017 at 12:00 PM, Jean-Baptiste Onofré 
wrote:


Yeah, however I didn't have any issue with the other GCP IOs. Only
SpannerIO has this issue and "blocks" the build locally.

(it's a simple mvn clean install on my machine)

Thanks !
Regards
JB


On 05/30/2017 08:40 PM, Lukasz Cwik wrote:


This is a known issue (https://issues.apache.org/jira/browse/BEAM-2131)
where our Jenkins runs use a machine GCP VM which allows for credentials
and project to be inferred automatically.


On Mon, May 29, 2017 at 9:48 AM, Jean-Baptiste Onofré 
wrote:

Yup, it seems so.


I created:

https://issues.apache.org/jira/browse/BEAM-2379

for the tracking and I gonna take a look waiting Mairbek's feedback.

Thanks !
Regards
JB


On 05/29/2017 06:43 PM, Dan Halperin wrote:

This looks like somewhere the unit tests are inferring a project from the

environment when they should not be doing so.

On Mon, May 29, 2017 at 8:38 AM, Jean-Baptiste Onofré 
wrote:

Gonna try to purge my local m2 repo.



Regards
JB


On 05/29/2017 08:05 AM, Jean-Baptiste Onofré wrote:

Hi team,



Since last week, the build is broken due to tests failure on the
GCP/Spanner IO:

java.lang.IllegalArgumentException: A project ID is required for this
service but could not be determined from the builder or the
environment.
Please set a project ID using the builder.

However, Jenkins seems OK on this. I checked and I don't see anything
special in the system variables or JVM arguments.

I started a change on the SpannerIO to get the project ID in the code
in
order to have the tests OK (fixing SpannerIO write). Depending of the
answers on this e-mail, I will create a pull request.

Do you think it's reasonable ? I don't see anything special in the
READMEr about new prerequisites for SpannerIO.

Does anyone else notice this tests failure ?

Thanks,
Regards
JB


--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com




--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [INFO] Build fails on GCP IO (Spanner)

2017-05-30 Thread Lukasz Cwik

JB, the issue is that we have been careful so far to not require a GCP
project or credentials as part of a test until SpannerIO broke this.
BEAM-2131 is about having a stronger precommit if Jenkins ran in an
environment which better modeled a users machine (e.g. no GCP
project/credentials can be inferred automatically).

On Tue, May 30, 2017 at 12:00 PM, Jean-Baptiste Onofré 
wrote:

> Yeah, however I didn't have any issue with the other GCP IOs. Only
> SpannerIO has this issue and "blocks" the build locally.
>
> (it's a simple mvn clean install on my machine)
>
> Thanks !
> Regards
> JB
>
>
> On 05/30/2017 08:40 PM, Lukasz Cwik wrote:
>
>> This is a known issue (https://issues.apache.org/jira/browse/BEAM-2131)
>> where our Jenkins runs use a machine GCP VM which allows for credentials
>> and project to be inferred automatically.
>>
>>
>> On Mon, May 29, 2017 at 9:48 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Yup, it seems so.
>>>
>>> I created:
>>>
>>> https://issues.apache.org/jira/browse/BEAM-2379
>>>
>>> for the tracking and I gonna take a look waiting Mairbek's feedback.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 05/29/2017 06:43 PM, Dan Halperin wrote:
>>>
>>> This looks like somewhere the unit tests are inferring a project from the
 environment when they should not be doing so.

 On Mon, May 29, 2017 at 8:38 AM, Jean-Baptiste Onofré 
 wrote:

 Gonna try to purge my local m2 repo.

>
> Regards
> JB
>
>
> On 05/29/2017 08:05 AM, Jean-Baptiste Onofré wrote:
>
> Hi team,
>
>>
>> Since last week, the build is broken due to tests failure on the
>> GCP/Spanner IO:
>>
>> java.lang.IllegalArgumentException: A project ID is required for this
>> service but could not be determined from the builder or the
>> environment.
>> Please set a project ID using the builder.
>>
>> However, Jenkins seems OK on this. I checked and I don't see anything
>> special in the system variables or JVM arguments.
>>
>> I started a change on the SpannerIO to get the project ID in the code
>> in
>> order to have the tests OK (fixing SpannerIO write). Depending of the
>> answers on this e-mail, I will create a pull request.
>>
>> Do you think it's reasonable ? I don't see anything special in the
>> READMEr about new prerequisites for SpannerIO.
>>
>> Does anyone else notice this tests failure ?
>>
>> Thanks,
>> Regards
>> JB
>>
>>
>> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
 --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] HadoopInputFormat based IOs

2017-05-30 Thread Ismaël Mejía

I agree 100% with Stephen points, I think that including a
'discoverability' section for these IOs that are shared by multiple
data stores is a great step, in particular for the HIF ones.

I would like that we define what would we do in concrete with the
HIFIO based implementations of IOs once their native implementation is
merged, e.g. today we have Cassandra and Elasticsearch5 examples based
on HIF that will be clearly redundant once we have the native
versions, so they should maybe moved into the proposed website
section. What do you guys think?

Any other ideas/comments on the general subject?



On Tue, May 23, 2017 at 7:25 PM, Stephen Sisk  wrote:
> hey,
>
> Thanks for bringing this up! It's definitely an interesting question and I
> can see both sides of the argument.
>
> I can see the appeal of HIFIO wrapper IOs as stop-gaps and if they have
> good test coverage, it does ensure that the HIFIO route is working. If we
> have good IT coverage, it also means there's fewer steps involved in
> building a native IO as well, since the ITs will already be written.
>
> However, I think I'm still assuming that the community will implement
> native IOs for most data stores that users want to interact with, and thus
> I'd still discourage building IOs that are just HIFIO/jdbc wrappers. I'd
> personally rather devote time and resources to native IOs. If we don't see
> traction on building more IOs then I'd be more open to it.
>
> If we do choose to go down this "Don't build HIFIO wrappers, just improve
> discoverability" route, one idea I had floating around in my head was that
> we might add a section to the Built-in IO Transforms page that covers
> "non-native but readable" IOs (better name suggestions appreciated :) -
> that could include a list of data stores that jdbc/jms/hifio support and
> link to HIFIO's info on how to use them. (That might also be a good place
> to document the performance tradeoffs of using HIFIO)
>
> S
>
>
> On Tue, May 23, 2017 at 9:53 AM Ismaël Mejía  wrote:
>
>> Hello, I bring this subject to the mailing list to see everybody’s
>> opinion on the subject.
>>
>> The recent inclusion of HadoopInputFormatIO (HiFiIO) gave Beam users
>> the option to ‘easily’ include data stores that support the
>> Hadoop-based partitioning scheme. There are currently examples of how
>> to use it for example to read from Elasticsearch and Cassandra. In
>> both cases we already have specific IOs on master or as WIP so using
>> HiFiIO based IO is not needed.
>>
>> During the review of the recent IO for Hive (HCatalog) that uses
>> HiFiIO instead of a native API, there was a discussion about the fact
>> that this shouldn’t be included as a specific IO but better to add the
>> tests/documentation of how to read Hive records using the existing
>> HiFiIO. This makes sense from an abstraction point of view, however
>> there are visibility issues since end users would need to repackage
>> and discover the supported (and tested) HiFi-based IOs that won’t be
>> explicit in the code base.
>>
>> I would like to know what other members of the community think about
>> this, is it worth to have individual IOs based on HiFiIO for things
>> that we currently don’t support (e.g. Hive or Amazon Redshift) (option
>> 1) or maybe it is just better to add just the tests/docs of how to use
>> them as proposed in the PR (option 2).
>>
>> Feel free to comment/vote or maybe add an eventual third option if you
>> think there is one better option.
>>
>> Regards,
>> Ismaël Mejía
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-1158
>>

Re: [INFO] Build fails on GCP IO (Spanner)

2017-05-30 Thread Jean-Baptiste Onofré

Yeah, however I didn't have any issue with the other GCP IOs. Only SpannerIO has 
this issue and "blocks" the build locally.


(it's a simple mvn clean install on my machine)

Thanks !
Regards
JB

On 05/30/2017 08:40 PM, Lukasz Cwik wrote:

This is a known issue (https://issues.apache.org/jira/browse/BEAM-2131)
where our Jenkins runs use a machine GCP VM which allows for credentials
and project to be inferred automatically.


On Mon, May 29, 2017 at 9:48 AM, Jean-Baptiste Onofré 
wrote:


Yup, it seems so.

I created:

https://issues.apache.org/jira/browse/BEAM-2379

for the tracking and I gonna take a look waiting Mairbek's feedback.

Thanks !
Regards
JB


On 05/29/2017 06:43 PM, Dan Halperin wrote:


This looks like somewhere the unit tests are inferring a project from the
environment when they should not be doing so.

On Mon, May 29, 2017 at 8:38 AM, Jean-Baptiste Onofré 
wrote:

Gonna try to purge my local m2 repo.


Regards
JB


On 05/29/2017 08:05 AM, Jean-Baptiste Onofré wrote:

Hi team,


Since last week, the build is broken due to tests failure on the
GCP/Spanner IO:

java.lang.IllegalArgumentException: A project ID is required for this
service but could not be determined from the builder or the environment.
Please set a project ID using the builder.

However, Jenkins seems OK on this. I checked and I don't see anything
special in the system variables or JVM arguments.

I started a change on the SpannerIO to get the project ID in the code in
order to have the tests OK (fixing SpannerIO write). Depending of the
answers on this e-mail, I will create a pull request.

Do you think it's reasonable ? I don't see anything special in the
READMEr about new prerequisites for SpannerIO.

Does anyone else notice this tests failure ?

Thanks,
Regards
JB



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [INFO] Build fails on GCP IO (Spanner)

2017-05-30 Thread Lukasz Cwik

This is a known issue (https://issues.apache.org/jira/browse/BEAM-2131)
where our Jenkins runs use a machine GCP VM which allows for credentials
and project to be inferred automatically.

On Mon, May 29, 2017 at 9:48 AM, Jean-Baptiste Onofré 
wrote:

> Yup, it seems so.
>
> I created:
>
> https://issues.apache.org/jira/browse/BEAM-2379
>
> for the tracking and I gonna take a look waiting Mairbek's feedback.
>
> Thanks !
> Regards
> JB
>
>
> On 05/29/2017 06:43 PM, Dan Halperin wrote:
>
>> This looks like somewhere the unit tests are inferring a project from the
>> environment when they should not be doing so.
>>
>> On Mon, May 29, 2017 at 8:38 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Gonna try to purge my local m2 repo.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 05/29/2017 08:05 AM, Jean-Baptiste Onofré wrote:
>>>
>>> Hi team,

 Since last week, the build is broken due to tests failure on the
 GCP/Spanner IO:

 java.lang.IllegalArgumentException: A project ID is required for this
 service but could not be determined from the builder or the environment.
 Please set a project ID using the builder.

 However, Jenkins seems OK on this. I checked and I don't see anything
 special in the system variables or JVM arguments.

 I started a change on the SpannerIO to get the project ID in the code in
 order to have the tests OK (fixing SpannerIO write). Depending of the
 answers on this e-mail, I will create a pull request.

 Do you think it's reasonable ? I don't see anything special in the
 READMEr about new prerequisites for SpannerIO.

 Does anyone else notice this tests failure ?

 Thanks,
 Regards
 JB

>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Python SDK: BigTableIO

2017-05-30 Thread Stephen Sisk

Hey Matthias,

to add on to what Chamikara mentioned, we have lots of info in the generic
IO authoring guide [1], the Python IO authoring guide [2] and the
PTransform Style Guide[3].  The PTransform style guide doesn't sound like
it applies, but it has a lot of specific tips from lessons we've learned in
the past from I/O work.

If you plan on contributing it back to the community, I'd also suggest
opening up a JIRA issue & updating the beam website (eg [4]) that you're
working on this (those steps are pretty trivial.)

We've recently been trying out using branches when we add new I/Os since
the PRs tend to get bigger than we like for a since PR.

Please feel free to email the dev mailing list if you have questions! We
are excited and happy to help out with thinking about design/etc... (eg, as
cham hinted at, should you use a Source vs. use regular ParDo transforms?)

S

[1] https://beam.apache.org/documentation/io/authoring-overview/
[2] https://beam.apache.org/documentation/sdks/python-custom-io/
[3] https://beam.apache.org/contribute/ptransform-style-guide/
[4] https://github.com/apache/beam-site/pull/250

On Sun, May 28, 2017 at 5:32 PM Chamikara Jayalath 
wrote:

> Thanks for offering to help. I would suggest to look into existing Java
> BigTableIO connector and currently available Python client library for
> Cloud BigTable to see how feasible it is to develop an efficient BigTable
> connector at this point. From Python SDK's perspective you can use
> iobase.BoundedSource API (wrapped by a PTrasnform) to develop a read
> PTransform with support for dynamic/static splitting. Sinks are usually
> developed as PTransforms (iobase.Sink interface is deprecated so I suggest
> not to use that). I would be happy to review any PRs related to this.
>
> Thanks,
> Cham
>
> On Sun, May 28, 2017 at 2:30 AM Matthias Baetens <
> matthias.baet...@datatonic.com> wrote:
>
> > Hey guys,
> >
> > We have been using Beam for quite a few months now, so we (my colleague
> > Robert & I) thought it might be cool to contribute a bit as well.
> >
> > The challenge we want to take up is writing the BigTableIO for the Python
> > SDK (which is not yet in the works according to the website
> > <
> >
> https://github.com/apache/beam-site/blob/asf-site/src/documentation/io/built-in.md
> > >.
> > I have searched JIRA for the BigTableIO issue and did not find it, so I
> > suppose this is the first step we take.
> >
> > Any pointers or feedback more than welcome!
> >
> > Best,
> >
> > Matthias
> >
>

Re: low availability in the coming 4 weeks

2017-05-30 Thread Aviem Zur

Congratulations!

On Fri, May 26, 2017 at 9:21 AM Kenneth Knowles 
wrote:

> Congrats!
>
> On Thu, May 25, 2017 at 2:00 PM, Raghu Angadi 
> wrote:
>
> > Congrats Mingmin. All the best!
> >
> > On Wed, May 24, 2017 at 8:33 PM, Mingmin Xu  wrote:
> >
> > > Hello everyone,
> > >
> > > I'll take 4 weeks off to take care of my new born baby. I'm very glad
> > that
> > > James Xu agrees to take my role in Beam SQL feature.
> > >
> > > Ps, I'll consolidate the PR for BEAM-2010 soon before that.
> > >
> > > Thank you!
> > > 
> > > Mingmin
> > >
> >
>

Re: Precommit Jenkins Linkage Broken

Precommit Jenkins Linkage Broken

Re: [DISCUSS] HadoopInputFormat based IOs

RE: [DISCUSS] HadoopInputFormat based IOs

Re: [DISCUSS] HadoopInputFormat based IOs

Re: [DISCUSS] HadoopInputFormat based IOs

Re: [INFO] Build fails on GCP IO (Spanner)

Re: [INFO] Build fails on GCP IO (Spanner)

Re: [DISCUSS] HadoopInputFormat based IOs

Re: [INFO] Build fails on GCP IO (Spanner)

Re: [INFO] Build fails on GCP IO (Spanner)

Re: Python SDK: BigTableIO

Re: low availability in the coming 4 weeks

13 matches

Site Navigation

Mail list logo

Footer information