Re: pubsub -> IO

2019-07-21 Thread Chaim Turkel
I have been working on this, and have a question.
The flow is: message -> by message get data -> continue flow.

I seem to need some of the information from the original message in
the next parts of my flow. Is there a simple way to do this using side
input and windows? do i need to add support for KV when getting the
data to pass on the orig object?
ideas? thanks

chaim

On Thu, Jul 18, 2019 at 12:45 PM Ismaël Mejía  wrote:
>
> Just discovered that RedisIO exposes ReadAll too so you can take a
> look at that one too.
>
> On Thu, Jul 18, 2019 at 11:39 AM Ismaël Mejía  wrote:
> >
> > Yes this is done in multiple IOs now, you can see how it is done in
> > JdbcIO or a simpler but without explicit ReadAll transform in SolrIO.
> > Notice that this change requires a refactor of the IO to avoid code 
> > repetition.
> > I filled https://issues.apache.org/jira/browse/BEAM-7769 and assigned
> > it to you, feel free to unassign if you don' t plan to work on it.
> > In the meantime I am going to try to expose ReadAll on SolrIO so we
> > can have it as a reference.
> >
> > On Thu, Jul 18, 2019 at 11:08 AM Chaim Turkel  wrote:
> > >
> > > is there another source that does this so i can have a look and add it
> > > to the MongoDBIO?
> > >
> > > On Wed, Jul 17, 2019 at 9:48 PM Eugene Kirpichov  
> > > wrote:
> > > >
> > > > I think full-blown SDF is not needed for this - someone just needs to 
> > > > implement a MongoDbIO.readAll() variant, using a composite transform. 
> > > > The regular pattern for this sort of thing will do (ParDo split, 
> > > > reshuffle, ParDo read).
> > > > Whether it's worth replacing MongoDbIO.read() with a redirect to 
> > > > readAll() is another matter - size estimation, available only in 
> > > > BoundedSource for now, may or may not be important.
> > > >
> > > > On Wed, Jul 17, 2019 at 2:39 AM Ryan Skraba  wrote:
> > > >>
> > > >> Hello!  To clarify, you want to do something like this?
> > > >>
> > > >> PubSubIO.read() -> extract mongodb collection and range -> 
> > > >> MongoDbIO.read(collection, range) -> ...
> > > >>
> > > >> If I'm not mistaken, it isn't possible with the implementation of 
> > > >> MongoDbIO (based on BoundedSource interface, requiring the collection 
> > > >> to be specified once at pipeline construction time).
> > > >>
> > > >> BUT -- this is a good candidate for an improvement in composability, 
> > > >> and the ongoing work to prefer the SDF for these types of use cases.   
> > > >> Maybe raise a JIRA for an improvement?
> > > >>
> > > >> All my best, Ryan
> > > >>
> > > >>
> > > >> On Wed, Jul 17, 2019 at 9:35 AM Chaim Turkel  wrote:
> > > >>>
> > > >>> any ideas?
> > > >>>
> > > >>> On Mon, Jul 15, 2019 at 11:04 PM Rui Wang  wrote:
> > > >>> >
> > > >>> > +u...@beam.apache.org
> > > >>> >
> > > >>> >
> > > >>> > -Rui
> > > >>> >
> > > >>> > On Mon, Jul 15, 2019 at 6:55 AM Chaim Turkel  
> > > >>> > wrote:
> > > >>> >>
> > > >>> >> Hi,
> > > >>> >>   I am looking to write a pipeline that read from a mongo 
> > > >>> >> collection.
> > > >>> >>   I would like to listen to a pubsub that will have a object that 
> > > >>> >> will
> > > >>> >> tell me which collection and which time frame.
> > > >>> >>   Is there a way to do this?
> > > >>> >>
> > > >>> >> Chaim
> > > >>> >>
> > > >>> >> --
> > > >>> >>
> > > >>> >>
> > > >>> >> Loans are funded by
> > > >>> >> FinWise Bank, a Utah-chartered bank located in Sandy,
> > > >>> >> Utah, member FDIC, Equal
> > > >>> >> Opportunity Lender. Merchant Cash Advances are
> > > >>> >> made by Behalf. For more
> > > >>> >> information on ECOA, click here
> > > >>> >> <https://www.behalf.com/legal/ecoa/>. For important information 
> > > >>> &

Re: pubsub -> IO

2019-07-18 Thread Chaim Turkel
is there another source that does this so i can have a look and add it
to the MongoDBIO?

On Wed, Jul 17, 2019 at 9:48 PM Eugene Kirpichov  wrote:
>
> I think full-blown SDF is not needed for this - someone just needs to 
> implement a MongoDbIO.readAll() variant, using a composite transform. The 
> regular pattern for this sort of thing will do (ParDo split, reshuffle, ParDo 
> read).
> Whether it's worth replacing MongoDbIO.read() with a redirect to readAll() is 
> another matter - size estimation, available only in BoundedSource for now, 
> may or may not be important.
>
> On Wed, Jul 17, 2019 at 2:39 AM Ryan Skraba  wrote:
>>
>> Hello!  To clarify, you want to do something like this?
>>
>> PubSubIO.read() -> extract mongodb collection and range -> 
>> MongoDbIO.read(collection, range) -> ...
>>
>> If I'm not mistaken, it isn't possible with the implementation of MongoDbIO 
>> (based on BoundedSource interface, requiring the collection to be specified 
>> once at pipeline construction time).
>>
>> BUT -- this is a good candidate for an improvement in composability, and the 
>> ongoing work to prefer the SDF for these types of use cases.   Maybe raise a 
>> JIRA for an improvement?
>>
>> All my best, Ryan
>>
>>
>> On Wed, Jul 17, 2019 at 9:35 AM Chaim Turkel  wrote:
>>>
>>> any ideas?
>>>
>>> On Mon, Jul 15, 2019 at 11:04 PM Rui Wang  wrote:
>>> >
>>> > +u...@beam.apache.org
>>> >
>>> >
>>> > -Rui
>>> >
>>> > On Mon, Jul 15, 2019 at 6:55 AM Chaim Turkel  wrote:
>>> >>
>>> >> Hi,
>>> >>   I am looking to write a pipeline that read from a mongo collection.
>>> >>   I would like to listen to a pubsub that will have a object that will
>>> >> tell me which collection and which time frame.
>>> >>   Is there a way to do this?
>>> >>
>>> >> Chaim
>>> >>
>>> >> --
>>> >>
>>> >>
>>> >> Loans are funded by
>>> >> FinWise Bank, a Utah-chartered bank located in Sandy,
>>> >> Utah, member FDIC, Equal
>>> >> Opportunity Lender. Merchant Cash Advances are
>>> >> made by Behalf. For more
>>> >> information on ECOA, click here
>>> >> <https://www.behalf.com/legal/ecoa/>. For important information about
>>> >> opening a new
>>> >> account, review Patriot Act procedures here
>>> >> <https://www.behalf.com/legal/patriot/>.
>>> >> Visit Legal
>>> >> <https://www.behalf.com/legal/> to
>>> >> review our comprehensive program terms,
>>> >> conditions, and disclosures.
>>>
>>> --
>>>
>>>
>>> Loans are funded by
>>> FinWise Bank, a Utah-chartered bank located in Sandy,
>>> Utah, member FDIC, Equal
>>> Opportunity Lender. Merchant Cash Advances are
>>> made by Behalf. For more
>>> information on ECOA, click here
>>> <https://www.behalf.com/legal/ecoa/>. For important information about
>>> opening a new
>>> account, review Patriot Act procedures here
>>> <https://www.behalf.com/legal/patriot/>.
>>> Visit Legal
>>> <https://www.behalf.com/legal/> to
>>> review our comprehensive program terms,
>>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: pubsub -> IO

2019-07-17 Thread Chaim Turkel
any ideas?

On Mon, Jul 15, 2019 at 11:04 PM Rui Wang  wrote:
>
> +u...@beam.apache.org
>
>
> -Rui
>
> On Mon, Jul 15, 2019 at 6:55 AM Chaim Turkel  wrote:
>>
>> Hi,
>>   I am looking to write a pipeline that read from a mongo collection.
>>   I would like to listen to a pubsub that will have a object that will
>> tell me which collection and which time frame.
>>   Is there a way to do this?
>>
>> Chaim
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>.
>> Visit Legal
>> <https://www.behalf.com/legal/> to
>> review our comprehensive program terms,
>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


pubsub -> IO

2019-07-15 Thread Chaim Turkel
Hi,
  I am looking to write a pipeline that read from a mongo collection.
  I would like to listen to a pubsub that will have a object that will
tell me which collection and which time frame.
  Is there a way to do this?

Chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: pipeline timeout

2019-07-09 Thread Chaim Turkel
sorry for not being explicit. My pipeline is in java and i am
executing it using python in airflow.
I would like from airflow to cancel the pipeline if running for more
than x minutes.
Currently i am doing this using the cli, but it is not optimal

chaim

On Mon, Jul 8, 2019 at 7:24 PM Mark Liu  wrote:
>
> Hi Chaim,
>
> You can checkout PipelineResult class and do something like:
>
> result = p.run()
> result.wait_until_finish(duration=TIMEOUT_SEC)
> if not PipelineState.is_terminal(result.state):
>   result.cancel()
>
> The implementation of PipelineResult depends on what runner you choose. And 
> you may find more useful functions in its subclass.
>
> Mark
>
>
> On Sun, Jul 7, 2019 at 12:59 AM Chaim Turkel  wrote:
>>
>> Hi,
>>   I have a pipeline that usually takes 15-30 minutes. Sometimes things
>> get stuck (from 3rd party side). I would like to know if there is a
>> way to cancel the job if it is running for more than x minutes? I know
>> there is a cli command but i would like it either on the pipeline
>> config or in the python sdk.
>> Any ideas?
>>
>> Chaim Turkel
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>.
>> Visit Legal
>> <https://www.behalf.com/legal/> to
>> review our comprehensive program terms,
>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


pipeline timeout

2019-07-07 Thread Chaim Turkel
Hi,
  I have a pipeline that usually takes 15-30 minutes. Sometimes things
get stuck (from 3rd party side). I would like to know if there is a
way to cancel the job if it is running for more than x minutes? I know
there is a cli command but i would like it either on the pipeline
config or in the python sdk.
Any ideas?

Chaim Turkel

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: jobs not started

2019-06-27 Thread Chaim Turkel
seems like a google issue:
https://status.cloud.google.com/

chaim

On Thu, Jun 27, 2019 at 10:23 AM Tim Robertson
 wrote:
>
> Hi Chaim,
>
> To help you we'd need a little more detail I think - what environment, 
> runner, how you launch your jobs etc.
>
> My first impression is that is sounds more like an environment related thing 
> rather than a Beam codebase issue. If it is a DataFlow environment I expect 
> you might need to explore the helpdesk of dataflow. I notice for example 
> others report this on SO
> https://stackoverflow.com/questions/30189691/dataflow-zombie-jobs-stuck-in-not-started-state
>
> I hope this is somewhat useful,
> Tim
>
> On Thu, Jun 27, 2019 at 8:12 AM Chaim Turkel  wrote:
>>
>> since the night all my jobs that i run are stuck in not started, and ideas 
>> why?
>> chaim
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>.
>> Visit Legal
>> <https://www.behalf.com/legal/> to
>> review our comprehensive program terms,
>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


jobs not started

2019-06-27 Thread Chaim Turkel
since the night all my jobs that i run are stuck in not started, and ideas why?
chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: pipeline status tracing

2019-06-23 Thread Chaim Turkel
i am writing to bigquery

On Fri, Jun 21, 2019 at 7:12 PM Alexey Romanenko
 wrote:
>
> I see that similar questions happen quite often in the last time, so, 
> probably, it would make sense to add this “Wait.on()”-pattern to 
> corresponding website documentation page [1].
>
> [1] https://beam.apache.org/documentation/patterns/file-processing-patterns/
>
> On 20 Jun 2019, at 23:00, Eugene Kirpichov  wrote:
>
> If you're writing to files, you can already do this: FileIO.write() returns 
> WriteFilesResult that you can use in combination with Wait.on() and 
> JdbcIO.write() to write something to a database afterwards.
> Something like:
>
> PCollection<..> writeResult = data.apply(FileIO.write()...);
> Create.of("dummy").apply(Wait.on(writeResult)).apply(JdbcIO.write(...write to 
> database...))
>
>
>
> On Wed, Jun 19, 2019 at 11:49 PM Chaim Turkel  wrote:
>>
>> is it something you can do, or point me in the right direction?
>> chaim
>>
>> On Wed, Jun 19, 2019 at 8:36 PM Kenneth Knowles  wrote:
>> >
>> > This sounds like a fairly simply and useful change to the IO to output 
>> > information about where it has written. It seems generally useful to 
>> > always output something like this, instead of PDone.
>> >
>> > Kenn
>> >
>> > On Wed, Jun 19, 2019 at 5:42 AM Chaim Turkel  wrote:
>> >>
>> >> Hi,
>> >>   I would like to write a status at the end of my pipline. I had
>> >> written about this in the past, and wanted to know if there are any
>> >> new ideas.
>> >> The problem is the end of the pipeline return PDone, and i can't do
>> >> anything with this.
>> >> So my senario is after i export data from mongo to google storage i
>> >> want to write to a db that the job was done with some extra
>> >> information.
>> >>
>> >> Chaim Turkel
>> >>
>> >> --
>> >>
>> >>
>> >> Loans are funded by
>> >> FinWise Bank, a Utah-chartered bank located in Sandy,
>> >> Utah, member FDIC, Equal
>> >> Opportunity Lender. Merchant Cash Advances are
>> >> made by Behalf. For more
>> >> information on ECOA, click here
>> >> <https://www.behalf.com/legal/ecoa/>. For important information about
>> >> opening a new
>> >> account, review Patriot Act procedures here
>> >> <https://www.behalf.com/legal/patriot/>.
>> >> Visit Legal
>> >> <https://www.behalf.com/legal/> to
>> >> review our comprehensive program terms,
>> >> conditions, and disclosures.
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>.
>> Visit Legal
>> <https://www.behalf.com/legal/> to
>> review our comprehensive program terms,
>> conditions, and disclosures.
>
>

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: pipeline status tracing

2019-06-20 Thread Chaim Turkel
is it something you can do, or point me in the right direction?
chaim

On Wed, Jun 19, 2019 at 8:36 PM Kenneth Knowles  wrote:
>
> This sounds like a fairly simply and useful change to the IO to output 
> information about where it has written. It seems generally useful to always 
> output something like this, instead of PDone.
>
> Kenn
>
> On Wed, Jun 19, 2019 at 5:42 AM Chaim Turkel  wrote:
>>
>> Hi,
>>   I would like to write a status at the end of my pipline. I had
>> written about this in the past, and wanted to know if there are any
>> new ideas.
>> The problem is the end of the pipeline return PDone, and i can't do
>> anything with this.
>> So my senario is after i export data from mongo to google storage i
>> want to write to a db that the job was done with some extra
>> information.
>>
>> Chaim Turkel
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>.
>> Visit Legal
>> <https://www.behalf.com/legal/> to
>> review our comprehensive program terms,
>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


pipeline status tracing

2019-06-19 Thread Chaim Turkel
Hi,
  I would like to write a status at the end of my pipline. I had
written about this in the past, and wanted to know if there are any
new ideas.
The problem is the end of the pipeline return PDone, and i can't do
anything with this.
So my senario is after i export data from mongo to google storage i
want to write to a db that the job was done with some extra
information.

Chaim Turkel

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: pardo - multiple instances

2019-06-02 Thread Chaim Turkel
thanks, i have a sideeffect and that was the issue

On Sun, Jun 2, 2019 at 12:54 PM Reuven Lax  wrote:
>
> In general a ParDo might be execute more than once. Most runners will only 
> allow one of those executions to have any effect on the pipeline (e.g. if you 
> are counting elements, this will make sure that you end up with the correct 
> count), however if your ParDo has side effects outside of your pipeline then 
> you need to make sure that those side effects are idempotent.
>
> On Sat, Jun 1, 2019 at 11:16 PM Chaim Turkel  wrote:
>>
>> Hi,
>>   I have a case where at the end of my pipline i have one element
>> output and in the last method i write the status and some clean up. I
>> see that sometimes this code is run more than once (is this possible,
>> and how can i prevent it?)
>>
>>
>> Chaim
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>.
>> Visit Legal
>> <https://www.behalf.com/legal/> to
>> review our comprehensive program terms,
>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


pardo - multiple instances

2019-06-02 Thread Chaim Turkel
Hi,
  I have a case where at the end of my pipline i have one element
output and in the last method i write the status and some clean up. I
see that sometimes this code is run more than once (is this possible,
and how can i prevent it?)


Chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: cancel job

2019-05-02 Thread Chaim Turkel
thanks for the reply,
  i am using airflow python code to run a java runner, so i do not
have the actual pipleine handler, is there a way to get it?
chaim

On Thu, May 2, 2019 at 7:58 PM Lukasz Cwik  wrote:
>
> +u...@beam.apache.org
>
> On Thu, May 2, 2019 at 9:51 AM Lukasz Cwik  wrote:
>>
>> ... build pipeline ...
>> pipeline_result = p.run()
>> if job_taking_too_long:
>>   pipeline_result.cancel()
>>
>> Python: 
>> https://github.com/apache/beam/blob/95d0ac5e5cb59fd0c6a8a4861a38a7087a6c46b5/sdks/python/apache_beam/runners/runner.py#L372
>> Java: 
>> https://github.com/apache/beam/blob/ce77db10cdd5f021f383721a90f30205aff0fabe/sdks/java/core/src/main/java/org/apache/beam/sdk/PipelineResult.java#L46
>>
>> On Wed, May 1, 2019 at 11:48 PM Chaim Turkel  wrote:
>>>
>>> Hi,
>>>I have a batch job that should run for about 40 minutes. There are
>>> times that it can run for hours, and i don't know why.
>>> I need the option to cancel the job if it runs for more than x
>>> minutes. I can do this from the gui or the gcloud cli.
>>>   Is there an api code that i can do this preferable in python, java is 
>>> also ok
>>>
>>>
>>> chaim
>>>
>>> --
>>>
>>>
>>> Loans are funded by
>>> FinWise Bank, a Utah-chartered bank located in Sandy,
>>> Utah, member FDIC, Equal
>>> Opportunity Lender. Merchant Cash Advances are
>>> made by Behalf. For more
>>> information on ECOA, click here
>>> <https://www.behalf.com/legal/ecoa/>. For important information about
>>> opening a new
>>> account, review Patriot Act procedures here
>>> <https://www.behalf.com/legal/patriot/>.
>>> Visit Legal
>>> <https://www.behalf.com/legal/> to
>>> review our comprehensive program terms,
>>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


cancel job

2019-05-02 Thread Chaim Turkel
Hi,
   I have a batch job that should run for about 40 minutes. There are
times that it can run for hours, and i don't know why.
I need the option to cancel the job if it runs for more than x
minutes. I can do this from the gui or the gcloud cli.
  Is there an api code that i can do this preferable in python, java is also ok


chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


pipeline steps

2019-02-07 Thread Chaim Turkel
Hi,
  I am working on a pipeline that listens to a topic on pubsub to get
files that have changes in the storage. Then i read avro files, and
would like to write them to bigquery based on the file name (to
different tables).
  My problem is that the transformer that reads the avro does not give
me back the files name (like a tuple or something like that). I seem
to have this pattern come back a lot.
Can you think of any solutions?

Chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: contributor in the Beam

2018-12-12 Thread Chaim Turkel
Hi,
  I have another pull request on MongoIO:
https://github.com/apache/beam/pull/7256
thanks
chaim
On Mon, Dec 3, 2018 at 11:36 AM Jean-Baptiste Onofré  wrote:
>
> Can you please fix the conflict in the PR ?
>
> Thanks
> Regards
> JB
>
> On 03/12/2018 08:52, Chaim Turkel wrote:
> > it looks like there was a failure that is not due to the code, how can
> > i continue the process?
> > https://github.com/apache/beam/pull/7162
> >
> > On Thu, Nov 29, 2018 at 9:15 PM Chaim Turkel  wrote:
> >>
> >> hi,
> >>   i added another pr for the case of a self signed certificate ssl on
> >> the mongodb server
> >>
> >> https://github.com/apache/beam/pull/7162
> >> On Wed, Nov 28, 2018 at 5:16 PM Jean-Baptiste Onofré  
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I already upgraded locally. Let me push the PR.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 28/11/2018 16:02, Chaim Turkel wrote:
> >>>> is there any reason that the mongo client version is still on 3.2.2?
> >>>> can you upgrade it to 3.9.0?
> >>>> chaim
> >>>> On Tue, Nov 27, 2018 at 4:48 PM Jean-Baptiste Onofré  
> >>>> wrote:
> >>>>>
> >>>>> Hi Chaim,
> >>>>>
> >>>>> The best is to create a Jira describing the new features you want to
> >>>>> add. Then, you can create a PR related to this Jira.
> >>>>>
> >>>>> As I'm the original MongoDbIO author, I would be more than happy to help
> >>>>> you and review the PR.
> >>>>>
> >>>>> Thanks !
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 27/11/2018 15:37, Chaim Turkel wrote:
> >>>>>> Hi,
> >>>>>>   I have added a few features to the MongoDbIO and would like to add
> >>>>>> them to the project.
> >>>>>> I have read https://beam.apache.org/contribute/
> >>>>>> I have added a jira user, what do i need to do next?
> >>>>>>
> >>>>>> chaim
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> jbono...@apache.org
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: contributor in the Beam

2018-12-05 Thread Chaim Turkel
can you please merge https://github.com/apache/beam/pull/7162
On Mon, Dec 3, 2018 at 11:36 AM Jean-Baptiste Onofré  wrote:
>
> Can you please fix the conflict in the PR ?
>
> Thanks
> Regards
> JB
>
> On 03/12/2018 08:52, Chaim Turkel wrote:
> > it looks like there was a failure that is not due to the code, how can
> > i continue the process?
> > https://github.com/apache/beam/pull/7162
> >
> > On Thu, Nov 29, 2018 at 9:15 PM Chaim Turkel  wrote:
> >>
> >> hi,
> >>   i added another pr for the case of a self signed certificate ssl on
> >> the mongodb server
> >>
> >> https://github.com/apache/beam/pull/7162
> >> On Wed, Nov 28, 2018 at 5:16 PM Jean-Baptiste Onofré  
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I already upgraded locally. Let me push the PR.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 28/11/2018 16:02, Chaim Turkel wrote:
> >>>> is there any reason that the mongo client version is still on 3.2.2?
> >>>> can you upgrade it to 3.9.0?
> >>>> chaim
> >>>> On Tue, Nov 27, 2018 at 4:48 PM Jean-Baptiste Onofré  
> >>>> wrote:
> >>>>>
> >>>>> Hi Chaim,
> >>>>>
> >>>>> The best is to create a Jira describing the new features you want to
> >>>>> add. Then, you can create a PR related to this Jira.
> >>>>>
> >>>>> As I'm the original MongoDbIO author, I would be more than happy to help
> >>>>> you and review the PR.
> >>>>>
> >>>>> Thanks !
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 27/11/2018 15:37, Chaim Turkel wrote:
> >>>>>> Hi,
> >>>>>>   I have added a few features to the MongoDbIO and would like to add
> >>>>>> them to the project.
> >>>>>> I have read https://beam.apache.org/contribute/
> >>>>>> I have added a jira user, what do i need to do next?
> >>>>>>
> >>>>>> chaim
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> jbono...@apache.org
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: 2019 Beam Events

2018-12-04 Thread Chaim Turkel
Israel would be nice to have one
chaim
On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas  wrote:
>
> Hi Beam Community,
>
> I started curating industry conferences, meetups and events that are relevant 
> for Beam, this initial list I came up with. I'd love your help adding others 
> that I might have overlooked. Once we're satisfied with the list, let's 
> re-share so we can coordinate proposal submissions, attendance and community 
> meetups there.
>
>
> Cheers,
>
> G
>
>
>

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: contributor in the Beam

2018-12-04 Thread Chaim Turkel
done, thanks
On Mon, Dec 3, 2018 at 11:36 AM Jean-Baptiste Onofré  wrote:
>
> Can you please fix the conflict in the PR ?
>
> Thanks
> Regards
> JB
>
> On 03/12/2018 08:52, Chaim Turkel wrote:
> > it looks like there was a failure that is not due to the code, how can
> > i continue the process?
> > https://github.com/apache/beam/pull/7162
> >
> > On Thu, Nov 29, 2018 at 9:15 PM Chaim Turkel  wrote:
> >>
> >> hi,
> >>   i added another pr for the case of a self signed certificate ssl on
> >> the mongodb server
> >>
> >> https://github.com/apache/beam/pull/7162
> >> On Wed, Nov 28, 2018 at 5:16 PM Jean-Baptiste Onofré  
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I already upgraded locally. Let me push the PR.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 28/11/2018 16:02, Chaim Turkel wrote:
> >>>> is there any reason that the mongo client version is still on 3.2.2?
> >>>> can you upgrade it to 3.9.0?
> >>>> chaim
> >>>> On Tue, Nov 27, 2018 at 4:48 PM Jean-Baptiste Onofré  
> >>>> wrote:
> >>>>>
> >>>>> Hi Chaim,
> >>>>>
> >>>>> The best is to create a Jira describing the new features you want to
> >>>>> add. Then, you can create a PR related to this Jira.
> >>>>>
> >>>>> As I'm the original MongoDbIO author, I would be more than happy to help
> >>>>> you and review the PR.
> >>>>>
> >>>>> Thanks !
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 27/11/2018 15:37, Chaim Turkel wrote:
> >>>>>> Hi,
> >>>>>>   I have added a few features to the MongoDbIO and would like to add
> >>>>>> them to the project.
> >>>>>> I have read https://beam.apache.org/contribute/
> >>>>>> I have added a jira user, what do i need to do next?
> >>>>>>
> >>>>>> chaim
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> jbono...@apache.org
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: contributor in the Beam

2018-12-02 Thread Chaim Turkel
it looks like there was a failure that is not due to the code, how can
i continue the process?
https://github.com/apache/beam/pull/7162

On Thu, Nov 29, 2018 at 9:15 PM Chaim Turkel  wrote:
>
> hi,
>   i added another pr for the case of a self signed certificate ssl on
> the mongodb server
>
> https://github.com/apache/beam/pull/7162
> On Wed, Nov 28, 2018 at 5:16 PM Jean-Baptiste Onofré  
> wrote:
> >
> > Hi,
> >
> > I already upgraded locally. Let me push the PR.
> >
> > Regards
> > JB
> >
> > On 28/11/2018 16:02, Chaim Turkel wrote:
> > > is there any reason that the mongo client version is still on 3.2.2?
> > > can you upgrade it to 3.9.0?
> > > chaim
> > > On Tue, Nov 27, 2018 at 4:48 PM Jean-Baptiste Onofré  
> > > wrote:
> > >>
> > >> Hi Chaim,
> > >>
> > >> The best is to create a Jira describing the new features you want to
> > >> add. Then, you can create a PR related to this Jira.
> > >>
> > >> As I'm the original MongoDbIO author, I would be more than happy to help
> > >> you and review the PR.
> > >>
> > >> Thanks !
> > >> Regards
> > >> JB
> > >>
> > >> On 27/11/2018 15:37, Chaim Turkel wrote:
> > >>> Hi,
> > >>>   I have added a few features to the MongoDbIO and would like to add
> > >>> them to the project.
> > >>> I have read https://beam.apache.org/contribute/
> > >>> I have added a jira user, what do i need to do next?
> > >>>
> > >>> chaim
> > >>>
> > >>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: contributor in the Beam

2018-11-29 Thread Chaim Turkel
hi,
  i added another pr for the case of a self signed certificate ssl on
the mongodb server

https://github.com/apache/beam/pull/7162
On Wed, Nov 28, 2018 at 5:16 PM Jean-Baptiste Onofré  wrote:
>
> Hi,
>
> I already upgraded locally. Let me push the PR.
>
> Regards
> JB
>
> On 28/11/2018 16:02, Chaim Turkel wrote:
> > is there any reason that the mongo client version is still on 3.2.2?
> > can you upgrade it to 3.9.0?
> > chaim
> > On Tue, Nov 27, 2018 at 4:48 PM Jean-Baptiste Onofré  
> > wrote:
> >>
> >> Hi Chaim,
> >>
> >> The best is to create a Jira describing the new features you want to
> >> add. Then, you can create a PR related to this Jira.
> >>
> >> As I'm the original MongoDbIO author, I would be more than happy to help
> >> you and review the PR.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On 27/11/2018 15:37, Chaim Turkel wrote:
> >>> Hi,
> >>>   I have added a few features to the MongoDbIO and would like to add
> >>> them to the project.
> >>> I have read https://beam.apache.org/contribute/
> >>> I have added a jira user, what do i need to do next?
> >>>
> >>> chaim
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: contributor in the Beam

2018-11-28 Thread Chaim Turkel
is there any reason that the mongo client version is still on 3.2.2?
can you upgrade it to 3.9.0?
chaim
On Tue, Nov 27, 2018 at 4:48 PM Jean-Baptiste Onofré  wrote:
>
> Hi Chaim,
>
> The best is to create a Jira describing the new features you want to
> add. Then, you can create a PR related to this Jira.
>
> As I'm the original MongoDbIO author, I would be more than happy to help
> you and review the PR.
>
> Thanks !
> Regards
> JB
>
> On 27/11/2018 15:37, Chaim Turkel wrote:
> > Hi,
> >   I have added a few features to the MongoDbIO and would like to add
> > them to the project.
> > I have read https://beam.apache.org/contribute/
> > I have added a jira user, what do i need to do next?
> >
> > chaim
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: contributor in the Beam

2018-11-28 Thread Chaim Turkel
i have created the pull request:
https://github.com/apache/beam/pull/7148
On Tue, Nov 27, 2018 at 4:48 PM Jean-Baptiste Onofré  wrote:
>
> Hi Chaim,
>
> The best is to create a Jira describing the new features you want to
> add. Then, you can create a PR related to this Jira.
>
> As I'm the original MongoDbIO author, I would be more than happy to help
> you and review the PR.
>
> Thanks !
> Regards
> JB
>
> On 27/11/2018 15:37, Chaim Turkel wrote:
> > Hi,
> >   I have added a few features to the MongoDbIO and would like to add
> > them to the project.
> > I have read https://beam.apache.org/contribute/
> > I have added a jira user, what do i need to do next?
> >
> > chaim
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


contributor in the Beam

2018-11-27 Thread Chaim Turkel
Hi,
  I have added a few features to the MongoDbIO and would like to add
them to the project.
I have read https://beam.apache.org/contribute/
I have added a jira user, what do i need to do next?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


MongoDbIO

2018-11-27 Thread Chaim Turkel
Hi,
  I would like to write a sync to validate that i have all records
from mongo in my bigquery.
to do this i would like to bring the fields id,time from mongo to
biguqery, and only on the missing docuements to read the full
document,
I did not see a way to bring a paritial document?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


MongoIO - streaming

2018-11-06 Thread Chaim Turkel
Hi,
  I am looking for a way to move my batch process of moving data from
mongo to bigquery via streaming.
  Does the MongoIO support streaming changes from mongo., if yes how
do you configure the query to check always for changes?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


bigquery issue

2018-10-31 Thread Chaim Turkel
Hi,
  I have an issue with the bigquery sdk code, where is the correct
group to send them?
chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: 2 tier input

2018-10-29 Thread Chaim Turkel
Both solutions mean that i cannot use the beam IO classes that will be
me the distribution, but i would have to get the data myself using a
ParDo method, is this something that will change in the future? i
understand that spark has a push down method that will pass the filter
to the next level of querys.
chaim
On Mon, Oct 22, 2018 at 4:02 PM Jeff Klukas  wrote:
>
> Chaim - If the full list of IDs is able to fit comfortably in memory and the 
> Mongo collection is small enough that you can read the whole collection, you 
> may want to fetch the IDs into a Java collection using the BigQuery API 
> directly, then turn them into a Beam PCollection using 
> Create.of(collection_of_ids). You could then use MongoDbIO.read() to read the 
> entire collection, but throw out rows based on the side input of IDs.
>
> If the list of IDs is particularly small, you could fetch the collection into 
> memory and parse that into a string filter that you pass to MongoDbIO.read() 
> to specify which documents to fetch, avoiding the need for a side input.
>
> Otherwise, if it's a large number of IDs, you may need to use Beam's 
> BigQueryIO to create a PCollection for the IDs, and then pass that into a 
> ParDo with a custom DoFn that issues Mongo queries for a batch of IDs. I'm 
> not very familiar with Mongo APIs, but you'd need to give the DoFn a 
> connection to Mongo that's serializable. You could likely look at the 
> implementation of MongoDbIO for inspiration there.
>
> On Sun, Oct 21, 2018 at 5:18 AM Chaim Turkel  wrote:
>>
>> hi,
>>   I have the following flow i need to implement.
>> From the bigquery i run a query and get a list of id's then i need to
>> load from mongo all the documents based on these id's and export them
>> as an xml file.
>> How do you suggest i go about doing this?
>>
>> chaim
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>.
>> Visit Legal
>> <https://www.behalf.com/legal/> to
>> review our comprehensive program terms,
>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


2 tier input

2018-10-21 Thread Chaim Turkel
hi,
  I have the following flow i need to implement.
>From the bigquery i run a query and get a list of id's then i need to
load from mongo all the documents based on these id's and export them
as an xml file.
How do you suggest i go about doing this?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: PBegin

2018-10-17 Thread Chaim Turkel
we need to add the option to configure the MongoClientOptions from
outside, thanks
chaim
On Mon, Oct 15, 2018 at 7:35 PM Chamikara Jayalath  wrote:
>
> MongoDBIO is based on BoundedSource framework so there's no easy way to 
> introduce custom code (a ParDo) that precede it in a pipeline. A ReadAll 
> transform (as JB mentioned) will be ParDo based and you will be able to have 
> a preceding custom ParDo that runs the initialization and feeds data into the 
> source. So I agree that this will be the proper solution.
>
> Downside is that some advanced features (for example, dynamic work 
> rebalancing) will not be supported till Splittable DoFn is fully fleshed out. 
> But looks like MongoDB currently does not support this feature anyways so it 
> should be OK.
>
> Thanks,
> Cham
>
> On Mon, Oct 15, 2018 at 7:08 AM Jean-Baptiste Onofré  
> wrote:
>>
>> JdbcIO uses the following:
>>
>>   return input
>>   .apply(Create.of((Void) null))
>>   .apply(
>>   JdbcIO.readAll()
>>   .withDataSourceConfiguration(getDataSourceConfiguration())
>>   .withQuery(getQuery())
>>   .withCoder(getCoder())
>>   .withRowMapper(getRowMapper())
>>   .withFetchSize(getFetchSize())
>>   .withParameterSetter(
>>   (element, preparedStatement) -> {
>> if (getStatementPreparator() != null) {
>>
>> getStatementPreparator().setParameters(preparedStatement);
>> }
>>   }));
>>
>> You can see that PBegin triggers readAll() that actually fires the
>> configuration and fetching.
>>
>> I think we can do the same in MongoDbIO.
>>
>> Regards
>> JB
>>
>> On 15/10/2018 16:00, Chaim Turkel wrote:
>> > what would be the implementation for the JdbcIO?
>> > On Mon, Oct 15, 2018 at 2:47 PM Jean-Baptiste Onofré  
>> > wrote:
>> >>
>> >> If you want to reuse MongoDbIO, there's no easy way.
>> >>
>> >> However, I can introduce the same change we did in Jdbc or Elasticsearch
>> >> IOs.
>> >>
>> >> Agree ?
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 15/10/2018 13:46, Chaim Turkel wrote:
>> >>> Thanks,
>> >>>   I need to wrap MongoDbIO.read, and don't see an easy way to do it
>> >>> chaim
>> >>> On Mon, Oct 15, 2018 at 2:30 PM Jean-Baptiste Onofré  
>> >>> wrote:
>> >>>>
>> >>>> Hi Chaim,
>> >>>>
>> >>>> you can take a look on JdbcIO.
>> >>>>
>> >>>> You can create any "startup" PCollection on the PBegin, and then you can
>> >>>> can the DoFn based on that.
>> >>>>
>> >>>> Regards
>> >>>> JB
>> >>>>
>> >>>> On 15/10/2018 13:00, Chaim Turkel wrote:
>> >>>>> Hi,
>> >>>>>   I there a way to write code before the PBegin.
>> >>>>> I am writeing a pipeline to connect to mongo with self signed ssl. I
>> >>>>> need to init the ssl connection of the java before the mongo code. So
>> >>>>> i need to write code before the PBegin but for it to run on each node?
>> >>>>>
>> >>>>>
>> >>>>> Chaim
>> >>>>>
>> >>>>
>> >>>> --
>> >>>> Jean-Baptiste Onofré
>> >>>> jbono...@apache.org
>> >>>> http://blog.nanthrax.net
>> >>>> Talend - http://www.talend.com
>> >>>
>> >>
>> >> --
>> >> Jean-Baptiste Onofré
>> >> jbono...@apache.org
>> >> http://blog.nanthrax.net
>> >> Talend - http://www.talend.com
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: PBegin

2018-10-15 Thread Chaim Turkel
what would be the implementation for the JdbcIO?
On Mon, Oct 15, 2018 at 2:47 PM Jean-Baptiste Onofré  wrote:
>
> If you want to reuse MongoDbIO, there's no easy way.
>
> However, I can introduce the same change we did in Jdbc or Elasticsearch
> IOs.
>
> Agree ?
>
> Regards
> JB
>
> On 15/10/2018 13:46, Chaim Turkel wrote:
> > Thanks,
> >   I need to wrap MongoDbIO.read, and don't see an easy way to do it
> > chaim
> > On Mon, Oct 15, 2018 at 2:30 PM Jean-Baptiste Onofré  
> > wrote:
> >>
> >> Hi Chaim,
> >>
> >> you can take a look on JdbcIO.
> >>
> >> You can create any "startup" PCollection on the PBegin, and then you can
> >> can the DoFn based on that.
> >>
> >> Regards
> >> JB
> >>
> >> On 15/10/2018 13:00, Chaim Turkel wrote:
> >>> Hi,
> >>>   I there a way to write code before the PBegin.
> >>> I am writeing a pipeline to connect to mongo with self signed ssl. I
> >>> need to init the ssl connection of the java before the mongo code. So
> >>> i need to write code before the PBegin but for it to run on each node?
> >>>
> >>>
> >>> Chaim
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: PBegin

2018-10-15 Thread Chaim Turkel
Thanks,
  I need to wrap MongoDbIO.read, and don't see an easy way to do it
chaim
On Mon, Oct 15, 2018 at 2:30 PM Jean-Baptiste Onofré  wrote:
>
> Hi Chaim,
>
> you can take a look on JdbcIO.
>
> You can create any "startup" PCollection on the PBegin, and then you can
> can the DoFn based on that.
>
> Regards
> JB
>
> On 15/10/2018 13:00, Chaim Turkel wrote:
> > Hi,
> >   I there a way to write code before the PBegin.
> > I am writeing a pipeline to connect to mongo with self signed ssl. I
> > need to init the ssl connection of the java before the mongo code. So
> > i need to write code before the PBegin but for it to run on each node?
> >
> >
> > Chaim
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


PBegin

2018-10-15 Thread Chaim Turkel
Hi,
  I there a way to write code before the PBegin.
I am writeing a pipeline to connect to mongo with self signed ssl. I
need to init the ssl connection of the java before the mongo code. So
i need to write code before the PBegin but for it to run on each node?


Chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


mongo - speed

2018-08-15 Thread Chaim Turkel
Hi,
  I have having issues reading from mongodb, the speed it very slow,
what can i do.
My job is:
https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-08-15_07_11_49-10653597849494077702?project=ordinal-ember-163410=782381653268

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


mongoio

2018-07-19 Thread Chaim Turkel
Hi,
  I have been seeing a very long time for the data to be loaded from mongodb.
Is there a way to check this or fine tune it?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: 2.5.0

2018-07-12 Thread Chaim Turkel
not sure what the problem was, but after deleting beam from my .m2 and
recompiling, it worked
thanks
chaim
On Thu, Jul 12, 2018 at 3:18 PM Alexey Romanenko
 wrote:
>
> Hi Chaim,
>
> Let me ask you some questions:
> - From which version you are trying to upgrade?
> - Did you properly set your environment (JDK, maven)?
> - Could you send an example of compile issues that you have?
> - Which command do you use to build and to run Quickstart example? I checked 
> it with Direct runner on my side - it works fine.
>
> Alexey
>
> > On 12 Jul 2018, at 10:00, Chaim Turkel  wrote:
> >
> > Hi,
> >  I have been trying to upgrade to 2.5.0 but I am having a lot of
> > compile issues.
> > So i used the quick start, and have the same issues:
> >
> > https://beam.apache.org/get-started/quickstart-java/
> >
> > Any ideas?
> >
> > chaim
> >
> > --
> >
> >
> > Loans are funded by
> > FinWise Bank, a Utah-chartered bank located in Sandy,
> > Utah, member FDIC, Equal
> > Opportunity Lender. Merchant Cash Advances are
> > made by Behalf. For more
> > information on ECOA, click here
> > <https://www.behalf.com/legal/ecoa/>. For important information about
> > opening a new
> > account, review Patriot Act procedures here
> > <https://www.behalf.com/legal/patriot/>.
> > Visit Legal
> > <https://www.behalf.com/legal/> to
> > review our comprehensive program terms,
> > conditions, and disclosures.
>

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: 2.5.0

2018-07-12 Thread Chaim Turkel
Hi,
  I am trying to upgrade from version 2.4.0

Though also i ran the command:

mvn archetype:generate \
  -DarchetypeGroupId=org.apache.beam \
  -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
  -DarchetypeVersion=2.5.0 \
  -DgroupId=org.example \
  -DartifactId=word-count-beam \
  -Dversion="0.1" \
  -Dpackage=org.apache.beam.examples \
  -DinteractiveMode=false

and then mvn clean package
it failes (using mvn 3.5.0 java 8)

example of failure:
word-count-beam/src/main/java/org/apache/beam/examples/complete/game/UserScore.java:[25,34]
package org.apache.beam.sdk.coders does not exist
word-count-beam/src/main/java/org/apache/beam/examples/complete/game/utils/WriteWindowedToBigQuery.java:[25,38]
package org.apache.beam.sdk.transforms does not exist
word-count-beam/src/main/java/org/apache/beam/examples/complete/game/utils/WriteToText.java:[29,44]
package org.apache.beam.sdk.io.FileBasedSink does not exist
On Thu, Jul 12, 2018 at 3:18 PM Alexey Romanenko
 wrote:
>
> Hi Chaim,
>
> Let me ask you some questions:
> - From which version you are trying to upgrade?
> - Did you properly set your environment (JDK, maven)?
> - Could you send an example of compile issues that you have?
> - Which command do you use to build and to run Quickstart example? I checked 
> it with Direct runner on my side - it works fine.
>
> Alexey
>
> > On 12 Jul 2018, at 10:00, Chaim Turkel  wrote:
> >
> > Hi,
> >  I have been trying to upgrade to 2.5.0 but I am having a lot of
> > compile issues.
> > So i used the quick start, and have the same issues:
> >
> > https://beam.apache.org/get-started/quickstart-java/
> >
> > Any ideas?
> >
> > chaim
> >
> > --
> >
> >
> > Loans are funded by
> > FinWise Bank, a Utah-chartered bank located in Sandy,
> > Utah, member FDIC, Equal
> > Opportunity Lender. Merchant Cash Advances are
> > made by Behalf. For more
> > information on ECOA, click here
> > <https://www.behalf.com/legal/ecoa/>. For important information about
> > opening a new
> > account, review Patriot Act procedures here
> > <https://www.behalf.com/legal/patriot/>.
> > Visit Legal
> > <https://www.behalf.com/legal/> to
> > review our comprehensive program terms,
> > conditions, and disclosures.
>

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures. 


2.5.0

2018-07-12 Thread Chaim Turkel
Hi,
  I have been trying to upgrade to 2.5.0 but I am having a lot of
compile issues.
So i used the quick start, and have the same issues:

https://beam.apache.org/get-started/quickstart-java/

Any ideas?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: BiqQueryIO.write and Wait.on

2018-07-05 Thread Chaim Turkel
yes, i must say i have been waiting for this for over 6 months, it
would help a lot
chaim
On Tue, Jul 3, 2018 at 5:14 PM Eugene Kirpichov  wrote:
>
> Awesome!! Thanks for the heads up, very exciting, this is going to make a lot 
> of people happy :)
>
> On Tue, Jul 3, 2018, 3:40 AM Carlos Alonso  wrote:
>>
>> + dev@beam.apache.org
>>
>> Just a quick email to let you know that I'm starting developing this.
>>
>> On Fri, Apr 20, 2018 at 10:30 PM Eugene Kirpichov  
>> wrote:
>>>
>>> Hi Carlos,
>>>
>>> Thank you for expressing interest in taking this on! Let me give you a few 
>>> pointers to start, and I'll be happy to help everywhere along the way.
>>>
>>> Basically we want BigQueryIO.write() to return something (e.g. a 
>>> PCollection) that can be used as input to Wait.on().
>>> Currently it returns a WriteResult, which only contains a 
>>> PCollection of failed inserts - that one can not be used 
>>> directly, instead we should add another component to WriteResult that 
>>> represents the result of successfully writing some data.
>>>
>>> Given that BQIO supports dynamic destination writes, I think it makes sense 
>>> for that to be a PCollection> so that in theory we 
>>> could sequence different destinations independently (currently Wait.on() 
>>> does not provide such a feature, but it could); and it will require 
>>> changing WriteResult to be WriteResult. As for what the "???" 
>>> might be - it is something that represents the result of successfully 
>>> writing a window of data. I think it can even be Void, or "?" (wildcard 
>>> type) for now, until we figure out something better.
>>>
>>> Implementing this would require roughly the following work:
>>> - Add this PCollection> to WriteResult
>>> - Modify the BatchLoads transform to provide it on both codepaths: 
>>> expandTriggered() and expandUntriggered()
>>> ...- expandTriggered() itself writes via 2 codepaths: single-partition and 
>>> multi-partition. Both need to be handled - we need to get a 
>>> PCollection> from each of them, and Flatten these two 
>>> PCollections together to get the final result. The single-partition 
>>> codepath (writeSinglePartition) under the hood already uses WriteTables 
>>> that returns a KV so it's directly usable. The 
>>> multi-partition codepath ends in WriteRenameTriggered - unfortunately, this 
>>> codepath drops DestinationT along the way and will need to be refactored a 
>>> bit to keep it until the end.
>>> ...- expandUntriggered() should be treated the same way.
>>> - Modify the StreamingWriteTables transform to provide it
>>> ...- Here also, the challenge is to propagate the DestinationT type all the 
>>> way until the end of StreamingWriteTables - it will need to be refactored. 
>>> After such a refactoring, returning a KV should be easy.
>>>
>>> Another challenge with all of this is backwards compatibility in terms of 
>>> API and pipeline update.
>>> Pipeline update is much less of a concern for the BatchLoads codepath, 
>>> because it's typically used in batch-mode pipelines that don't get updated. 
>>> I would recommend to start with this, perhaps even with only the 
>>> untriggered codepath (it is much more commonly used) - that will pave the 
>>> way for future work.
>>>
>>> Hope this helps, please ask more if something is unclear!
>>>
>>> On Fri, Apr 20, 2018 at 12:48 AM Carlos Alonso  wrote:

 Hey Eugene!!

 I’d gladly take a stab on it although I’m not sure how much available time 
 I might have to put into but... yeah, let’s try it.

 Where should I begin? Is there a Jira issue or shall I file one?

 Thanks!
 On Thu, 12 Apr 2018 at 00:41, Eugene Kirpichov  
 wrote:
>
> Hi,
>
> Yes, you're both right - BigQueryIO.write() is currently not implemented 
> in a way that it can be used with Wait.on(). It would certainly be a 
> welcome contribution to change this - many people expressed interest in 
> specifically waiting for BigQuery writes. Is any of you interested in 
> helping out?
>
> Thanks.
>
> On Fri, Apr 6, 2018 at 12:36 AM Carlos Alonso  
> wrote:
>>
>> Hi Simon, I think your explanation was very accurate, at least to my 
>> understanding. I'd also be interested in getting batch load result's 
>> feedback on the pipeline... hopefully someone may suggest something, 
>> otherwise we could propose submitting a Jira, or even better, a PR!! :)
>>
>> Thanks!
>>
>> On Thu, Apr 5, 2018 at 2:01 PM Simon Kitching 
>>  wrote:
>>>
>>> Hi All,
>>>
>>> I need to write some data to BigQuery (batch-mode) and then send a 
>>> Pubsub message to trigger further processing.
>>>
>>> I found this thread titled "Callbacks/other functions run after a 
>>> PDone/output transform" on the user-list which was very relevant:
>>>   
>>> 

Re: bigquery issue

2018-01-18 Thread Chaim Turkel
i must say i am disappointed in the errors, when i run the same code
in a different project it works, so it must be some limitation that
exists that i don't know
chaim

On Tue, Jan 16, 2018 at 8:22 PM, Lukasz Cwik <lc...@google.com> wrote:
> Look at the worker logs. This page shows how to log information and how to
> find what was logged:
> https://cloud.google.com/dataflow/pipelines/logging#cloud-dataflow-worker-log-example
> The worker logs contain a lot of information written by Dataflow and also by
> your code. Note that you may need to change log levels to get enough
> information:
> https://cloud.google.com/dataflow/pipelines/logging#SettingLevels
>
> Also good to take a look at this generic troubleshooting information:
> https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline
>
> On Mon, Jan 15, 2018 at 12:18 AM, Chaim Turkel <ch...@behalf.com> wrote:
>>
>> Hi,
>>   I have a fairly simple pipeline that create daily snapshots of my
>> data, and it sometimes fails, but the reason is not obvious:
>>
>>
>> (863777e448a29a5c): Workflow failed. Causes: (863777e448a298ff):
>>
>> S41:Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/GroupByWindow+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/ExpandIterable+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables)+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/extract
>> table name
>> +Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/count/GroupByKey+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/count/Combine.GroupedValues/Partial+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/count/GroupByKey/Reify+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/count/GroupByKey/Write+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/WithKeys/AddKeys/Map+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/Window.Into()/Window.Assign+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/GroupByKey/Reify+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/GroupByKey/Write
>> failed., (a412b0e93a586d57): A work item was attempted 4 times without
>> success. Each time the worker eventually lost contact with the
>> service. The work item was attempted on:
>> dailysnapshotoptions-chai-01142358-ef6c-harness-qw6s,
>> dailysnapshotoptions-chai-01142358-ef6c-harness-j09b,
>> dailysnapshotoptions-chai-01142358-ef6c-harness-3t9m,
>> dailysnapshotoptions-chai-01142358-ef6c-harness-372t
>>
>> is there any way to get more information?
>>
>> the job is:
>>
>>
>>
>> https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-01-14_23_58_54-6125672650598375925?project=ordinal-ember-163410=782381653268
>>
>>
>> chaim
>>
>> --
>>
>>
>> Loans are funded by FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>. Visit Legal
>> <https://www.behalf.com/legal/> to review our comprehensive program terms,
>> conditions, and disclosures.
>
>

-- 


Loans are funded by FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>. Visit Legal 
<https://www.behalf.com/legal/> to review our comprehensive program terms, 
conditions, and disclosures. 


bigquery issue

2018-01-15 Thread Chaim Turkel
Hi,
  I have a fairly simple pipeline that create daily snapshots of my
data, and it sometimes fails, but the reason is not obvious:


(863777e448a29a5c): Workflow failed. Causes: (863777e448a298ff):
S41:Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/GroupByWindow+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/ExpandIterable+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables)+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/extract
table name 
+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/count/GroupByKey+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/count/Combine.GroupedValues/Partial+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/count/GroupByKey/Reify+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables).WriteTablesMainOutput/count/GroupByKey/Write+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/WithKeys/AddKeys/Map+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/Window.Into()/Window.Assign+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/GroupByKey/Reify+Account_audit/BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/GroupByKey/Write
failed., (a412b0e93a586d57): A work item was attempted 4 times without
success. Each time the worker eventually lost contact with the
service. The work item was attempted on:
dailysnapshotoptions-chai-01142358-ef6c-harness-qw6s,
dailysnapshotoptions-chai-01142358-ef6c-harness-j09b,
dailysnapshotoptions-chai-01142358-ef6c-harness-3t9m,
dailysnapshotoptions-chai-01142358-ef6c-harness-372t

is there any way to get more information?

the job is:


https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-01-14_23_58_54-6125672650598375925?project=ordinal-ember-163410=782381653268


chaim

-- 


Loans are funded by FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more information on ECOA, click here 
. For important information about 
opening a new account, review Patriot Act procedures here 
. Visit Legal 
 to review our comprehensive program terms, 
conditions, and disclosures. 


Re: read source - MongoDbIO.read()

2017-10-30 Thread Chaim Turkel
can you send me a link to the code?

On Mon, Oct 30, 2017 at 2:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> That's the evolution I'm proposing and I already implemented in some IO: 
> readAll pattern. Let me check for mongo.
>
> On Oct 30, 2017, 12:00, at 12:00, Chaim Turkel <ch...@behalf.com> wrote:
>>I am syncing multiple tables from mongo to bigquery.
>>So i first check how many records there are, and then if there are
>>records a need to sync them, else i need to update the status table,
>>that there was nothing to sync. Also in the case that I do sync i need
>>to update the status table with information about the sync.
>>
>>Why can't the read start from a collection also?
>>chaim
>>
>>On Mon, Oct 30, 2017 at 12:13 PM, Jean-Baptiste Onofré
>><j...@nanthrax.net> wrote:
>>> Can you describe your use case ? We can imagine to be able to define
>>a custom FN in the read. But I'm afraid it would be too specific.
>>>
>>> On Oct 30, 2017, 10:46, at 10:46, Chaim Turkel <ch...@behalf.com>
>>wrote:
>>>>any reason for this, there should be a way to run it from any point
>>>>
>>>>On Mon, Oct 30, 2017 at 11:24 AM, Jean-Baptiste Onofré
>>>><j...@nanthrax.net> wrote:
>>>>> Hi
>>>>>
>>>>> No the pipeline starts with the read. You can always create your
>>own
>>>>custom read.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Oct 30, 2017, 09:33, at 09:33, Chaim Turkel <ch...@behalf.com>
>>>>wrote:
>>>>>>Hi,
>>>>>>   Is there a way to have some code run before the read?
>>>>>>I would like to check before how many records exists and based on
>>>>this
>>>>>>have two different pipelines.
>>>>>>Currently this code is in the runner but since i have 20 tables
>>this
>>>>>>takes a long time.
>>>>>>I would like to move the check into the pipeline -
>>>>>>
>>>>>>any ideas?
>>>>>>
>>>>>>
>>>>>>chaim


Re: read source - MongoDbIO.read()

2017-10-30 Thread Chaim Turkel
thanks, and bigquery

On Mon, Oct 30, 2017 at 2:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> That's the evolution I'm proposing and I already implemented in some IO: 
> readAll pattern. Let me check for mongo.
>
> On Oct 30, 2017, 12:00, at 12:00, Chaim Turkel <ch...@behalf.com> wrote:
>>I am syncing multiple tables from mongo to bigquery.
>>So i first check how many records there are, and then if there are
>>records a need to sync them, else i need to update the status table,
>>that there was nothing to sync. Also in the case that I do sync i need
>>to update the status table with information about the sync.
>>
>>Why can't the read start from a collection also?
>>chaim
>>
>>On Mon, Oct 30, 2017 at 12:13 PM, Jean-Baptiste Onofré
>><j...@nanthrax.net> wrote:
>>> Can you describe your use case ? We can imagine to be able to define
>>a custom FN in the read. But I'm afraid it would be too specific.
>>>
>>> On Oct 30, 2017, 10:46, at 10:46, Chaim Turkel <ch...@behalf.com>
>>wrote:
>>>>any reason for this, there should be a way to run it from any point
>>>>
>>>>On Mon, Oct 30, 2017 at 11:24 AM, Jean-Baptiste Onofré
>>>><j...@nanthrax.net> wrote:
>>>>> Hi
>>>>>
>>>>> No the pipeline starts with the read. You can always create your
>>own
>>>>custom read.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Oct 30, 2017, 09:33, at 09:33, Chaim Turkel <ch...@behalf.com>
>>>>wrote:
>>>>>>Hi,
>>>>>>   Is there a way to have some code run before the read?
>>>>>>I would like to check before how many records exists and based on
>>>>this
>>>>>>have two different pipelines.
>>>>>>Currently this code is in the runner but since i have 20 tables
>>this
>>>>>>takes a long time.
>>>>>>I would like to move the check into the pipeline -
>>>>>>
>>>>>>any ideas?
>>>>>>
>>>>>>
>>>>>>chaim


Re: read source - MongoDbIO.read()

2017-10-30 Thread Chaim Turkel
I am syncing multiple tables from mongo to bigquery.
So i first check how many records there are, and then if there are
records a need to sync them, else i need to update the status table,
that there was nothing to sync. Also in the case that I do sync i need
to update the status table with information about the sync.

Why can't the read start from a collection also?
chaim

On Mon, Oct 30, 2017 at 12:13 PM, Jean-Baptiste Onofré <j...@nanthrax.net> 
wrote:
> Can you describe your use case ? We can imagine to be able to define a custom 
> FN in the read. But I'm afraid it would be too specific.
>
> On Oct 30, 2017, 10:46, at 10:46, Chaim Turkel <ch...@behalf.com> wrote:
>>any reason for this, there should be a way to run it from any point
>>
>>On Mon, Oct 30, 2017 at 11:24 AM, Jean-Baptiste Onofré
>><j...@nanthrax.net> wrote:
>>> Hi
>>>
>>> No the pipeline starts with the read. You can always create your own
>>custom read.
>>>
>>> Regards
>>> JB
>>>
>>> On Oct 30, 2017, 09:33, at 09:33, Chaim Turkel <ch...@behalf.com>
>>wrote:
>>>>Hi,
>>>>   Is there a way to have some code run before the read?
>>>>I would like to check before how many records exists and based on
>>this
>>>>have two different pipelines.
>>>>Currently this code is in the runner but since i have 20 tables this
>>>>takes a long time.
>>>>I would like to move the check into the pipeline -
>>>>
>>>>any ideas?
>>>>
>>>>
>>>>chaim


Re: read source - MongoDbIO.read()

2017-10-30 Thread Chaim Turkel
any reason for this, there should be a way to run it from any point

On Mon, Oct 30, 2017 at 11:24 AM, Jean-Baptiste Onofré <j...@nanthrax.net> 
wrote:
> Hi
>
> No the pipeline starts with the read. You can always create your own custom 
> read.
>
> Regards
> JB
>
> On Oct 30, 2017, 09:33, at 09:33, Chaim Turkel <ch...@behalf.com> wrote:
>>Hi,
>>   Is there a way to have some code run before the read?
>>I would like to check before how many records exists and based on this
>>have two different pipelines.
>>Currently this code is in the runner but since i have 20 tables this
>>takes a long time.
>>I would like to move the check into the pipeline -
>>
>>any ideas?
>>
>>
>>chaim


pipeline distribution

2017-10-30 Thread Chaim Turkel
Hi,
  I have a pipeline that has more that 20 collections. It seems that
dataflow cannot deploy this pipeline.
I see that from the code I can create more than one pipeline.

Any one know what the limit is?
Also if i split it, is there a recommended way as to how (the
collection have different amount of data)

chaim


read source - MongoDbIO.read()

2017-10-30 Thread Chaim Turkel
Hi,
   Is there a way to have some code run before the read?
I would like to check before how many records exists and based on this
have two different pipelines.
Currently this code is in the runner but since i have 20 tables this
takes a long time.
I would like to move the check into the pipeline -

any ideas?


chaim


Re: UnconsumedReads

2017-10-27 Thread Chaim Turkel
I saw it in the gui of dataflow
Currently i emulate the code to catch the finish of bigquery write since it
is not externalized

On Wed, 25 Oct 2017, 19:46 Griselda Cuevas, <g...@google.com.invalid> wrote:

> Hi Chaim - could you elaborate a little more on how you're using or want to
> use the UnconsumedReads so we can help?
>
>
>
> Gris Cuevas Zambrano
>
> g...@google.com
>
> Open Source Strategy
>
> 345 Spear Street, San Francisco, 94105
> <https://maps.google.com/?q=345+Spear+Street,+San+Francisco,+94105=gmail=g>
>
>
>
> On 17 October 2017 at 06:03, Chaim Turkel <ch...@behalf.com> wrote:
>
> > what is the purpose of this phase?
> >
> > chaim
> >
>


Re: MongoDbIO

2017-10-17 Thread Chaim Turkel
be glad to go over the jira to make sure that we understand each other
chaim

On Tue, Oct 17, 2017 at 5:00 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Let me take a quick look and create a Jira if required.
>
> Thanks for the idea !
>
> Regards
> JB
>
>
> On 10/17/2017 01:54 PM, Chaim Turkel wrote:
>>
>> I am fine if you can show me how to do the Create.of() on the
>> MongoDbIO.read().
>> It would be nice also to have the status also on the MongoDbIO.write.
>>
>> again this is the reactive streams pattern that there always is a
>> complete or error path
>>
>> chaim
>>
>> On Tue, Oct 17, 2017 at 2:51 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>>
>>> Good point but I was thinking more about a PTransform with Create.of() if
>>> there's no data.
>>>
>>> Anyway, I can add a DoFn support in the write to update a status (not
>>> sure
>>> if it really makes sense).
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 10/17/2017 01:40 PM, Chaim Turkel wrote:
>>>>
>>>>
>>>> but if there is no data then
>>>> .apply(ParDo.of(new DoFn() { // check PCollection and set the status }))
>>>>
>>>> will not be called
>>>>
>>>> On Tue, Oct 17, 2017 at 8:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>>>
>>>>>
>>>>> I didn't mean on the read, I meant between the read and write.
>>>>>
>>>>> Basically, your pipeline could look like:
>>>>>
>>>>> pipeline.apply(MongoDbIO.read()...)
>>>>>   .apply(ParDo.of(new DoFn() { // check PCollection and set the
>>>>> status
>>>>> }))
>>>>>   .apply(MongoDbIO.write()...)
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 10/16/2017 09:42 PM, Chaim Turkel wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> how to i add a ParDo on the MongoDbIO.read() if there are no records
>>>>>> read?
>>>>>>
>>>>>> On Mon, Oct 16, 2017 at 6:53 PM, Jean-Baptiste Onofré
>>>>>> <j...@nanthrax.net>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You can always add your own ParDo(DoFn) where you write the status.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 10/16/2017 04:24 PM, Chaim Turkel wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> that is the problem, i want to write a status that i tried and that
>>>>>>>> there were no records
>>>>>>>>
>>>>>>>> On Mon, Oct 16, 2017 at 3:51 PM, Jean-Baptiste Onofré
>>>>>>>> <j...@nanthrax.net>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Chaim,
>>>>>>>>>
>>>>>>>>> So, you mean you call MongoDBIO.write() with an empty PCollection
>>>>>>>>> (no
>>>>>>>>> element in the collection) ?
>>>>>>>>>
>>>>>>>>> The write is basically a DoFn so, it won't do anything if the
>>>>>>>>> PCollection
>>>>>>>>> is
>>>>>>>>> empty.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/16/2017 01:59 PM, Chaim Turkel wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>In the case that there are no records to read, is there a
>>>>>>>>>> way
>>>>>>>>>> to
>>>>>>>>>> get
>>>>>>>>>> called so that i can write the status?
>>>>>>>>>>
>>>>>>>>>> chaim
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>> jbono...@apache.org
>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>> Talend - http://www.talend.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbono...@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbono...@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>
>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


UnconsumedReads

2017-10-17 Thread Chaim Turkel
what is the purpose of this phase?

chaim


Re: MongoDbIO

2017-10-17 Thread Chaim Turkel
I am fine if you can show me how to do the Create.of() on the MongoDbIO.read().
It would be nice also to have the status also on the MongoDbIO.write.

again this is the reactive streams pattern that there always is a
complete or error path

chaim

On Tue, Oct 17, 2017 at 2:51 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Good point but I was thinking more about a PTransform with Create.of() if
> there's no data.
>
> Anyway, I can add a DoFn support in the write to update a status (not sure
> if it really makes sense).
>
> Regards
> JB
>
>
> On 10/17/2017 01:40 PM, Chaim Turkel wrote:
>>
>> but if there is no data then
>> .apply(ParDo.of(new DoFn() { // check PCollection and set the status }))
>>
>> will not be called
>>
>> On Tue, Oct 17, 2017 at 8:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>>
>>> I didn't mean on the read, I meant between the read and write.
>>>
>>> Basically, your pipeline could look like:
>>>
>>> pipeline.apply(MongoDbIO.read()...)
>>>  .apply(ParDo.of(new DoFn() { // check PCollection and set the
>>> status
>>> }))
>>>  .apply(MongoDbIO.write()...)
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 10/16/2017 09:42 PM, Chaim Turkel wrote:
>>>>
>>>>
>>>> how to i add a ParDo on the MongoDbIO.read() if there are no records
>>>> read?
>>>>
>>>> On Mon, Oct 16, 2017 at 6:53 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>>>
>>>>>
>>>>> You can always add your own ParDo(DoFn) where you write the status.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 10/16/2017 04:24 PM, Chaim Turkel wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> that is the problem, i want to write a status that i tried and that
>>>>>> there were no records
>>>>>>
>>>>>> On Mon, Oct 16, 2017 at 3:51 PM, Jean-Baptiste Onofré
>>>>>> <j...@nanthrax.net>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Chaim,
>>>>>>>
>>>>>>> So, you mean you call MongoDBIO.write() with an empty PCollection (no
>>>>>>> element in the collection) ?
>>>>>>>
>>>>>>> The write is basically a DoFn so, it won't do anything if the
>>>>>>> PCollection
>>>>>>> is
>>>>>>> empty.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 10/16/2017 01:59 PM, Chaim Turkel wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>   In the case that there are no records to read, is there a way
>>>>>>>> to
>>>>>>>> get
>>>>>>>> called so that i can write the status?
>>>>>>>>
>>>>>>>> chaim
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbono...@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbono...@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>
>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: MongoDbIO

2017-10-16 Thread Chaim Turkel
how to i add a ParDo on the MongoDbIO.read() if there are no records read?

On Mon, Oct 16, 2017 at 6:53 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> You can always add your own ParDo(DoFn) where you write the status.
>
> Regards
> JB
>
>
> On 10/16/2017 04:24 PM, Chaim Turkel wrote:
>>
>> that is the problem, i want to write a status that i tried and that
>> there were no records
>>
>> On Mon, Oct 16, 2017 at 3:51 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>>
>>> Hi Chaim,
>>>
>>> So, you mean you call MongoDBIO.write() with an empty PCollection (no
>>> element in the collection) ?
>>>
>>> The write is basically a DoFn so, it won't do anything if the PCollection
>>> is
>>> empty.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 10/16/2017 01:59 PM, Chaim Turkel wrote:
>>>>
>>>>
>>>> Hi,
>>>> In the case that there are no records to read, is there a way to get
>>>> called so that i can write the status?
>>>>
>>>> chaim
>>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: MongoDbIO

2017-10-16 Thread Chaim Turkel
that is the problem, i want to write a status that i tried and that
there were no records

On Mon, Oct 16, 2017 at 3:51 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Hi Chaim,
>
> So, you mean you call MongoDBIO.write() with an empty PCollection (no
> element in the collection) ?
>
> The write is basically a DoFn so, it won't do anything if the PCollection is
> empty.
>
> Regards
> JB
>
>
> On 10/16/2017 01:59 PM, Chaim Turkel wrote:
>>
>> Hi,
>>In the case that there are no records to read, is there a way to get
>> called so that i can write the status?
>>
>> chaim
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


MongoDbIO

2017-10-16 Thread Chaim Turkel
Hi,
  In the case that there are no records to read, is there a way to get
called so that i can write the status?

chaim


reactive streams

2017-10-16 Thread Chaim Turkel
Hi,
   I think beam would benefit if it were to adopt the reactive streams api.

This would mean that all sources (currently have on next) would add
oncomplete and onerror
The same for sinks.

This would make it much easier to add statuses and to handle empty
sets and errors in sets


chaim


Re: BigQueryIO Partitions

2017-09-28 Thread Chaim Turkel
you can see my job at:
https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-09-26_03_17_44-4821512213867199289?project=ordinal-ember-163410


On Wed, Sep 27, 2017 at 10:47 PM, Reuven Lax <re...@google.com.invalid> wrote:
> There are a couple of options, and if you provide a job id (since you are
> using the Dataflow runner) we can better advise.
>
> If you are writing to more than 2000 partitions, this won't work - BigQuery
> has a hard quota of 1000 partition updates per table per day.
>
> If you have fewer than 1000 jobs, there are a few possibilities. It's
> possible that BigQuery is taking a while to schedule some of those jobs;
> they'll end up sitting in a queue waiting to be scheduled. We can look at
> one of the jobs in detail to see if that's happening. Eugene's suggestion
> of using your pipeline to load into a single table might be the best one.
> You can write the date into a separate column, and then write a shell
> script to copy each date to it's own partition (see
> https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query
> for some examples).
>
> On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
>> I see. Then Reuven's answer above applies.
>> Maybe you could write to a non-partitioned table, and then split it into
>> smaller partitioned tables. See https://stackoverflow.com/a/
>> 39001706/278042
>> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the
>> current options - granted, it seems like there currently don't exist very
>> good options for creating a very large number of table partitions from
>> existing data.
>>
>> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <ch...@behalf.com> wrote:
>>
>> > thank you for your detailed response.
>> > Currently i am a bit stuck.
>> > I need to migrate data from mongo to bigquery, we have about 1 terra
>> > of data. It is history data, so i want to use bigquery partitions.
>> > It seems that the io connector creates a job per partition so it takes
>> > a very long time, and i hit the quota in bigquery of the amount of
>> > jobs per day.
>> > I would like to use streaming but you cannot stream old data more than 30
>> > day
>> >
>> > So I thought of partitions to see if i can do more parraleism
>> >
>> > chaim
>> >
>> >
>> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
>> > <kirpic...@google.com.invalid> wrote:
>> > > Okay, I see - there's about 3 different meanings of the word
>> "partition"
>> > > that could have been involved here (BigQuery partitions,
>> runner-specific
>> > > bundles, and the Partition transform), hence my request for
>> > clarification.
>> > >
>> > > If you mean the Partition transform - then I'm confused what do you
>> mean
>> > by
>> > > BigQueryIO "supporting" it? The Partition transform takes a PCollection
>> > and
>> > > produces a bunch of PCollections; these are ordinary PCollection's and
>> > you
>> > > can apply any Beam transforms to them, and BigQueryIO.write() is no
>> > > exception to this - you can apply it too.
>> > >
>> > > To answer whether using Partition would improve your performance, I'd
>> > need
>> > > to understand exactly what you're comparing against what. I suppose
>> > you're
>> > > comparing the following:
>> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
>> > table
>> > > 2) Splitting a PCollection into several smaller PCollection's using
>> > > Partition, and applying BigQueryIO.write() to each of them, writing to
>> > > different tables I suppose? (or do you want to write to different
>> > BigQuery
>> > > partitions of the same table using a table partition decorator?)
>> > > I would expect #2 to perform strictly worse than #1, because it writes
>> > the
>> > > same amount of data but increases the number of BigQuery load jobs
>> > involved
>> > > (thus increases per-job overhead and consumes BigQuery quota).
>> > >
>> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <ch...@behalf.com>
>> wrote:
>> > >
>> > >> https://beam.apache.org/documentation/programming-guide/#partition
>> > >>
>> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
>> > >> <kir

Re: BigQueryIO Partitions

2017-09-27 Thread Chaim Turkel
thank you for your detailed response.
Currently i am a bit stuck.
I need to migrate data from mongo to bigquery, we have about 1 terra
of data. It is history data, so i want to use bigquery partitions.
It seems that the io connector creates a job per partition so it takes
a very long time, and i hit the quota in bigquery of the amount of
jobs per day.
I would like to use streaming but you cannot stream old data more than 30 day

So I thought of partitions to see if i can do more parraleism

chaim


On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Okay, I see - there's about 3 different meanings of the word "partition"
> that could have been involved here (BigQuery partitions, runner-specific
> bundles, and the Partition transform), hence my request for clarification.
>
> If you mean the Partition transform - then I'm confused what do you mean by
> BigQueryIO "supporting" it? The Partition transform takes a PCollection and
> produces a bunch of PCollections; these are ordinary PCollection's and you
> can apply any Beam transforms to them, and BigQueryIO.write() is no
> exception to this - you can apply it too.
>
> To answer whether using Partition would improve your performance, I'd need
> to understand exactly what you're comparing against what. I suppose you're
> comparing the following:
> 1) Applying BigQueryIO.write() to a PCollection, writing to a single table
> 2) Splitting a PCollection into several smaller PCollection's using
> Partition, and applying BigQueryIO.write() to each of them, writing to
> different tables I suppose? (or do you want to write to different BigQuery
> partitions of the same table using a table partition decorator?)
> I would expect #2 to perform strictly worse than #1, because it writes the
> same amount of data but increases the number of BigQuery load jobs involved
> (thus increases per-job overhead and consumes BigQuery quota).
>
> On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> https://beam.apache.org/documentation/programming-guide/#partition
>>
>> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
>> <kirpic...@google.com.invalid> wrote:
>> > What do you mean by Beam partitions?
>> >
>> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> by the way currently the performance on bigquery partitions is very bad.
>> >> Is there a repository where i can test with 2.2.0?
>> >>
>> >> chaim
>> >>
>> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid>
>> >> wrote:
>> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the
>> >> table
>> >> > containing the partitions is not pre created (fixed in 2.2.0).
>> >> >
>> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >>Does BigQueryIO support Partitions when writing? will it improve
>> my
>> >> >> performance?
>> >> >>
>> >> >>
>> >> >> chaim
>> >> >>
>> >>
>>


Re: BigQueryIO Partitions

2017-09-26 Thread Chaim Turkel
by the way currently the performance on bigquery partitions is very bad.
Is there a repository where i can test with 2.2.0?

chaim

On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the table
> containing the partitions is not pre created (fixed in 2.2.0).
>
> On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> Hi,
>>
>>Does BigQueryIO support Partitions when writing? will it improve my
>> performance?
>>
>>
>> chaim
>>


Pipeline performance

2017-09-26 Thread Chaim Turkel
Hi,
  I am transforming multiple tables from mongo to bigquery (about 20)
currently i have one pipeline for each table. Each table is a
collection. Is there a limitation for how many collections i can have?
Would it be better to create multiple pipelines?


chaim


Re: BigQueryIO Partitions

2017-09-26 Thread Chaim Turkel
no i mean't beam partitions

On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the table
> containing the partitions is not pre created (fixed in 2.2.0).
>
> On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> Hi,
>>
>>Does BigQueryIO support Partitions when writing? will it improve my
>> performance?
>>
>>
>> chaim
>>


BigQueryIO Partitions

2017-09-26 Thread Chaim Turkel
Hi,

   Does BigQueryIO support Partitions when writing? will it improve my
performance?


chaim


Re: multiple PCollections

2017-09-17 Thread Chaim Turkel
The exceptions could be from bad data - i am working on it, or from
quota exceeded.
The problem is that if i have 2 collections in the pipline, and one
fails on the quota, the other will fail also, even thought it should
have succeeeded

On Sat, Sep 16, 2017 at 10:35 PM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> There is no way to do catch an exception inside a transform unless you
> wrote the transform yourself and have control over the code of its DoFn's.
> That's why I'm asking whether configuring bad records would be an
> acceptable workaround.
>
> On Sat, Sep 16, 2017, 11:07 AM Chaim Turkel <ch...@behalf.com> wrote:
>
>> i am using batch, since streaming cannot be done with partitions with
>> old data more than 30 days.
>> the question is how can i catch the exception in the pipline so that
>> other collections do not fail
>>
>> On Fri, Sep 15, 2017 at 7:37 PM, Eugene Kirpichov
>> <kirpic...@google.com.invalid> wrote:
>> > Are you using streaming inserts or batch loads method for writing?
>> > If it's streaming inserts, BigQueryIO already can return the bad records,
>> > and I believe it won't fail the pipeline, so I'm assuming it's batch
>> loads.
>> > For batch loads, would it be sufficient for your purposes if
>> > BigQueryIO.read() let you configure the configuration.load.maxBadRecords
>> > parameter (see
>> https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs
>> > )?
>> >
>> > On Thu, Sep 14, 2017 at 10:29 PM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> I am using the sink of BigQueryIO so the example is not the same. The
>> >> example is bad data from reading, I have problems when writting. There
>> >> can be multiple errors when writing to BigQuery, and if it fails there
>> >> is no way to catch this error, and the whole pipeline fails
>> >>
>> >> chaim
>> >>
>> >> On Thu, Sep 14, 2017 at 5:48 PM, Reuven Lax <re...@google.com.invalid>
>> >> wrote:
>> >> > What sort of error? You can always put a try/catch inside your DoFns
>> to
>> >> > catch the majority of errors. A common pattern is to save records that
>> >> > caused exceptions out to a separate output so you can debug them. This
>> >> blog
>> >> > post
>> >> > <
>> >>
>> https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow
>> >> >
>> >> > explained
>> >> > the pattern.
>> >> >
>> >> > Reuven
>> >> >
>> >> > On Thu, Sep 14, 2017 at 1:43 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >>   In one pipeline I have multiple PCollections. If I have an error on
>> >> >> one then the whole pipline is canceled, is there a way to catch the
>> >> >> error and log it, and for all other PCollections to continue?
>> >> >>
>> >> >>
>> >> >> chaim
>> >> >>
>> >>
>>


Re: multiple PCollections

2017-09-16 Thread Chaim Turkel
i am using batch, since streaming cannot be done with partitions with
old data more than 30 days.
the question is how can i catch the exception in the pipline so that
other collections do not fail

On Fri, Sep 15, 2017 at 7:37 PM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Are you using streaming inserts or batch loads method for writing?
> If it's streaming inserts, BigQueryIO already can return the bad records,
> and I believe it won't fail the pipeline, so I'm assuming it's batch loads.
> For batch loads, would it be sufficient for your purposes if
> BigQueryIO.read() let you configure the configuration.load.maxBadRecords
> parameter (see https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs
> )?
>
> On Thu, Sep 14, 2017 at 10:29 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> I am using the sink of BigQueryIO so the example is not the same. The
>> example is bad data from reading, I have problems when writting. There
>> can be multiple errors when writing to BigQuery, and if it fails there
>> is no way to catch this error, and the whole pipeline fails
>>
>> chaim
>>
>> On Thu, Sep 14, 2017 at 5:48 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > What sort of error? You can always put a try/catch inside your DoFns to
>> > catch the majority of errors. A common pattern is to save records that
>> > caused exceptions out to a separate output so you can debug them. This
>> blog
>> > post
>> > <
>> https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow
>> >
>> > explained
>> > the pattern.
>> >
>> > Reuven
>> >
>> > On Thu, Sep 14, 2017 at 1:43 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> Hi,
>> >>
>> >>   In one pipeline I have multiple PCollections. If I have an error on
>> >> one then the whole pipline is canceled, is there a way to catch the
>> >> error and log it, and for all other PCollections to continue?
>> >>
>> >>
>> >> chaim
>> >>
>>


Re: multiple PCollections

2017-09-14 Thread Chaim Turkel
I am using the sink of BigQueryIO so the example is not the same. The
example is bad data from reading, I have problems when writting. There
can be multiple errors when writing to BigQuery, and if it fails there
is no way to catch this error, and the whole pipeline fails

chaim

On Thu, Sep 14, 2017 at 5:48 PM, Reuven Lax <re...@google.com.invalid> wrote:
> What sort of error? You can always put a try/catch inside your DoFns to
> catch the majority of errors. A common pattern is to save records that
> caused exceptions out to a separate output so you can debug them. This blog
> post
> <https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow>
> explained
> the pattern.
>
> Reuven
>
> On Thu, Sep 14, 2017 at 1:43 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> Hi,
>>
>>   In one pipeline I have multiple PCollections. If I have an error on
>> one then the whole pipline is canceled, is there a way to catch the
>> error and log it, and for all other PCollections to continue?
>>
>>
>> chaim
>>


multiple PCollections

2017-09-14 Thread Chaim Turkel
Hi,

  In one pipeline I have multiple PCollections. If I have an error on
one then the whole pipline is canceled, is there a way to catch the
error and log it, and for all other PCollections to continue?


chaim


Re: PBegin, PDone

2017-09-14 Thread Chaim Turkel
My use case is that I have generic code to transfer for example tables
from mongo to bigquery. I iterate over all tables in mongo and create
a PCollection for each. But there are things that need to be checked
before running, and to run only if validated.
I tried the visitor but there is no way to stop a PCollection from running.

It would be nice to have hooks that during run time (not graph time) I
can decide on the PBegin not to start

chaim

On Thu, Sep 14, 2017 at 9:25 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Hi,
>
> I don't think it makes sense on a transform (as it expects a PCollection).
> However, why not introducing a specific hook for that.
>
> I think you can workaround using a Pipeline Visitor, but it would be runner
> level.
>
> Regards
> JB
>
>
> On 09/14/2017 08:21 AM, Chaim Turkel wrote:
>>
>> Hi,
>>I have a few scenarios where I would like to have code that is
>> before the PBegin and after the PDone.
>> This is usually for monitoring purposes.
>> It would be nice to be able to transform from PBegin to PBegin, and
>> PDone to PDone, so that code can be run before and after and not in
>> the driver program
>>
>>
>> chaim
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


PBegin, PDone

2017-09-14 Thread Chaim Turkel
Hi,
  I have a few scenarios where I would like to have code that is
before the PBegin and after the PDone.
This is usually for monitoring purposes.
It would be nice to be able to transform from PBegin to PBegin, and
PDone to PDone, so that code can be run before and after and not in
the driver program


chaim


Re: BigQueryIO withSchemaFromView

2017-09-13 Thread Chaim Turkel
just found, that with bigquery you cannot stream to partitions older
than 30 days (so i can't use it anyway to load old data) :(

On Wed, Sep 13, 2017 at 7:08 PM, Lukasz Cwik <lc...@google.com.invalid> wrote:
> Support was added to expose how users want to load their data with
> https://github.com/apache/beam/commit/075d4d45a9cd398f3b4023b6efd495cc58eb9bdd
> It is planned to be released in 2.2.0
>
> On Tue, Sep 12, 2017 at 11:48 PM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> from what i found it I have the windowing with bigquery partition (per
>> day - 1545 partitions) the insert can take 5 hours, where if there is
>> no partitions then it takes about 12 minutes
>>
>> I have 13,843,080 recrods 6.76 GB.
>> Any ideas how to get the partition to work faster.
>>
>> Is there a way to get the BigQueryIO to use streaming and not jobs?
>>
>> chaim
>>
>> On Tue, Sep 12, 2017 at 11:32 PM, Chaim Turkel <ch...@behalf.com> wrote:
>> > i am using windowing for the partion of the table, maybe that has to do
>> with it?
>> >
>> > On Tue, Sep 12, 2017 at 11:25 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> >> Ok, something is going wrong then. It appears that your job created over
>> >> 14,000 BigQuery load jobs, which is not expected (and probably why
>> things
>> >> were so slow).
>> >>
>> >> On Tue, Sep 12, 2017 at 8:50 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >>
>> >>> no that specific job created only 2 tables
>> >>>
>> >>> On Tue, Sep 12, 2017 at 4:36 PM, Reuven Lax <re...@google.com.invalid>
>> >>> wrote:
>> >>> > It looks like your job is creating about 14,45 distinct BigQuery
>> tables.
>> >>> > Does that sound correct to you?
>> >>> >
>> >>> > Reuven
>> >>> >
>> >>> > On Tue, Sep 12, 2017 at 6:22 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >>> >
>> >>> >> the job id is 2017-09-12_02_57_55-5233544151932101752
>> >>> >> as you can see the majority of the time is inserting into bigquery.
>> >>> >> is there any way to parallel this?
>> >>> >>
>> >>> >> My feeling for the windowing is that writing should be done per
>> window
>> >>> >> (my window is daily) or at least to be able to configure it
>> >>> >>
>> >>> >> chaim
>> >>> >>
>> >>> >> On Tue, Sep 12, 2017 at 4:10 PM, Reuven Lax
>> <re...@google.com.invalid>
>> >>> >> wrote:
>> >>> >> > So the problem is you are running on Dataflow, and it's taking
>> longer
>> >>> >> than
>> >>> >> > you think it should? If you provide the Dataflow job id we can
>> help
>> >>> you
>> >>> >> > debug why it's taking 30 minutes. (and as an aside, if this turns
>> >>> into a
>> >>> >> > Dataflow debugging session we should move it off of the Beam list
>> and
>> >>> >> onto
>> >>> >> > a Dataflow-specific tread)
>> >>> >> >
>> >>> >> > Reuven
>> >>> >> >
>> >>> >> > On Tue, Sep 12, 2017 at 3:28 AM, Chaim Turkel <ch...@behalf.com>
>> >>> wrote:
>> >>> >> >
>> >>> >> >> is there a way around this, my time for 13gb is not close to 30
>> >>> >> >> minutes, while it should be around 15 minutes.
>> >>> >> >> Do i need to chunk the code myself to windows, and run in
>> parallel?
>> >>> >> >> chaim
>> >>> >> >>
>> >>> >> >> On Sun, Sep 10, 2017 at 6:32 PM, Reuven Lax
>> <re...@google.com.invalid
>> >>> >
>> >>> >> >> wrote:
>> >>> >> >> > In that case I can say unequivocally that Dataflow (in batch
>> mode)
>> >>> >> does
>> >>> >> >> not
>> >>> >> >> > produce results for a stage until it has processed that entire
>> >>> stage.
>> >>> >> The
>> >>> >> >> > reason for this is that the batch runner is optimized for
>> >

Re: BigQueryIO withSchemaFromView

2017-09-13 Thread Chaim Turkel
.apply("insert data table - " + table.getTableName(),
BigQueryIO.writeTableRows()
.to(TableRefPartition.perDay(options.getBQProject(),
options.getDatasetId(), table.getBqTableName()))
.withSchemaFromView(tableSchemas)

.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)


public class TableRefPartition implements
SerializableFunction<ValueInSingleWindow, TableDestination>
{

private final String projectId;
private final String datasetId;
private final String pattern;
private final String table;

public static TableRefPartition perDay(String projectId, String
datasetId, String tablePrefix) {
return new TableRefPartition(projectId, datasetId, "MMdd",
tablePrefix + "$");
}

private TableRefPartition(String projectId, String datasetId,
String pattern, String table) {
this.projectId = projectId;
this.datasetId = datasetId;
this.pattern = pattern;
this.table = table;
}

@Override
public TableDestination apply(ValueInSingleWindow input) {
DateTimeFormatter partition =
DateTimeFormat.forPattern(pattern).withZoneUTC();

TableReference reference = new TableReference();
reference.setProjectId(this.projectId);
reference.setDatasetId(this.datasetId);

reference.setTableId(table +
input.getWindow().maxTimestamp().toString(partition));
return new TableDestination(reference, null);
}
}

On Wed, Sep 13, 2017 at 9:40 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Can you show us some of the code you are using? How are you loading into
> separate partitions?
>
> Reuven
>
> On Wed, Sep 13, 2017 at 10:13 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>>  I am loading into separate partitions of the same table.
>> I want to see it streaming will be faster.
>>
>> Is there a repository where i can use the snapshot version?
>>
>>
>> On Wed, Sep 13, 2017 at 7:19 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > Ah, so you are loading each window into a separate BigQuery table? That
>> > might be the reason things are slow. Remembert a batch job doesn't return
>> > until everything finishes, and if you are loading that many tables it's
>> > entirely possible that BigQuery will throttle you, causing the slowdown.
>> >
>> > A couple of options:
>> >
>> > 1. Instead of loading into separate BigQuery tables, you could load into
>> > separate partitions of the same table. See this page for more info:
>> > https://cloud.google.com/bigquery/docs/partitioned-tables
>> >
>> > 2. If you have a streaming unbounded source for your data, you can run
>> > using a streaming runner. That will load each window as it becomes
>> > available instead of waiting for everything to load.
>> >
>> > Reuven
>> >
>> > On Tue, Sep 12, 2017 at 11:48 PM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> from what i found it I have the windowing with bigquery partition (per
>> >> day - 1545 partitions) the insert can take 5 hours, where if there is
>> >> no partitions then it takes about 12 minutes
>> >>
>> >> I have 13,843,080 recrods 6.76 GB.
>> >> Any ideas how to get the partition to work faster.
>> >>
>> >> Is there a way to get the BigQueryIO to use streaming and not jobs?
>> >>
>> >> chaim
>> >>
>> >> On Tue, Sep 12, 2017 at 11:32 PM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> > i am using windowing for the partion of the table, maybe that has to
>> do
>> >> with it?
>> >> >
>> >> > On Tue, Sep 12, 2017 at 11:25 PM, Reuven Lax <re...@google.com.invalid
>> >
>> >> wrote:
>> >> >> Ok, something is going wrong then. It appears that your job created
>> over
>> >> >> 14,000 BigQuery load jobs, which is not expected (and probably why
>> >> things
>> >> >> were so slow).
>> >> >>
>> >> >> On Tue, Sep 12, 2017 at 8:50 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >>
>> >> >>> no that specific job created only 2 tables
>> >> >>>
>> >> >>> On Tue, Sep 12, 2017 at 4:36 PM, Reuven Lax
>> <re...@google.com.invalid>
>> >> >>> wrote:
>> >> >>> > It looks like your job is creating about 14,45 distinct BigQuery
>> >> tables.
>> >> >>> &g

Re: BigQueryIO withSchemaFromView

2017-09-13 Thread Chaim Turkel
just went over the changes for the streaming method.
That looks great.
How about adding the option to continue the apply after success with
statistics or something like in the failure

On Wed, Sep 13, 2017 at 7:19 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Ah, so you are loading each window into a separate BigQuery table? That
> might be the reason things are slow. Remembert a batch job doesn't return
> until everything finishes, and if you are loading that many tables it's
> entirely possible that BigQuery will throttle you, causing the slowdown.
>
> A couple of options:
>
> 1. Instead of loading into separate BigQuery tables, you could load into
> separate partitions of the same table. See this page for more info:
> https://cloud.google.com/bigquery/docs/partitioned-tables
>
> 2. If you have a streaming unbounded source for your data, you can run
> using a streaming runner. That will load each window as it becomes
> available instead of waiting for everything to load.
>
> Reuven
>
> On Tue, Sep 12, 2017 at 11:48 PM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> from what i found it I have the windowing with bigquery partition (per
>> day - 1545 partitions) the insert can take 5 hours, where if there is
>> no partitions then it takes about 12 minutes
>>
>> I have 13,843,080 recrods 6.76 GB.
>> Any ideas how to get the partition to work faster.
>>
>> Is there a way to get the BigQueryIO to use streaming and not jobs?
>>
>> chaim
>>
>> On Tue, Sep 12, 2017 at 11:32 PM, Chaim Turkel <ch...@behalf.com> wrote:
>> > i am using windowing for the partion of the table, maybe that has to do
>> with it?
>> >
>> > On Tue, Sep 12, 2017 at 11:25 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> >> Ok, something is going wrong then. It appears that your job created over
>> >> 14,000 BigQuery load jobs, which is not expected (and probably why
>> things
>> >> were so slow).
>> >>
>> >> On Tue, Sep 12, 2017 at 8:50 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >>
>> >>> no that specific job created only 2 tables
>> >>>
>> >>> On Tue, Sep 12, 2017 at 4:36 PM, Reuven Lax <re...@google.com.invalid>
>> >>> wrote:
>> >>> > It looks like your job is creating about 14,45 distinct BigQuery
>> tables.
>> >>> > Does that sound correct to you?
>> >>> >
>> >>> > Reuven
>> >>> >
>> >>> > On Tue, Sep 12, 2017 at 6:22 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >>> >
>> >>> >> the job id is 2017-09-12_02_57_55-5233544151932101752
>> >>> >> as you can see the majority of the time is inserting into bigquery.
>> >>> >> is there any way to parallel this?
>> >>> >>
>> >>> >> My feeling for the windowing is that writing should be done per
>> window
>> >>> >> (my window is daily) or at least to be able to configure it
>> >>> >>
>> >>> >> chaim
>> >>> >>
>> >>> >> On Tue, Sep 12, 2017 at 4:10 PM, Reuven Lax
>> <re...@google.com.invalid>
>> >>> >> wrote:
>> >>> >> > So the problem is you are running on Dataflow, and it's taking
>> longer
>> >>> >> than
>> >>> >> > you think it should? If you provide the Dataflow job id we can
>> help
>> >>> you
>> >>> >> > debug why it's taking 30 minutes. (and as an aside, if this turns
>> >>> into a
>> >>> >> > Dataflow debugging session we should move it off of the Beam list
>> and
>> >>> >> onto
>> >>> >> > a Dataflow-specific tread)
>> >>> >> >
>> >>> >> > Reuven
>> >>> >> >
>> >>> >> > On Tue, Sep 12, 2017 at 3:28 AM, Chaim Turkel <ch...@behalf.com>
>> >>> wrote:
>> >>> >> >
>> >>> >> >> is there a way around this, my time for 13gb is not close to 30
>> >>> >> >> minutes, while it should be around 15 minutes.
>> >>> >> >> Do i need to chunk the code myself to windows, and run in
>> parallel?
>> >>> >> >> chaim
>> >>> >> >>
>> >>> >> >> On Sun, Se

Re: BigQueryIO withSchemaFromView

2017-09-13 Thread Chaim Turkel
 I am loading into separate partitions of the same table.
I want to see it streaming will be faster.

Is there a repository where i can use the snapshot version?


On Wed, Sep 13, 2017 at 7:19 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Ah, so you are loading each window into a separate BigQuery table? That
> might be the reason things are slow. Remembert a batch job doesn't return
> until everything finishes, and if you are loading that many tables it's
> entirely possible that BigQuery will throttle you, causing the slowdown.
>
> A couple of options:
>
> 1. Instead of loading into separate BigQuery tables, you could load into
> separate partitions of the same table. See this page for more info:
> https://cloud.google.com/bigquery/docs/partitioned-tables
>
> 2. If you have a streaming unbounded source for your data, you can run
> using a streaming runner. That will load each window as it becomes
> available instead of waiting for everything to load.
>
> Reuven
>
> On Tue, Sep 12, 2017 at 11:48 PM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> from what i found it I have the windowing with bigquery partition (per
>> day - 1545 partitions) the insert can take 5 hours, where if there is
>> no partitions then it takes about 12 minutes
>>
>> I have 13,843,080 recrods 6.76 GB.
>> Any ideas how to get the partition to work faster.
>>
>> Is there a way to get the BigQueryIO to use streaming and not jobs?
>>
>> chaim
>>
>> On Tue, Sep 12, 2017 at 11:32 PM, Chaim Turkel <ch...@behalf.com> wrote:
>> > i am using windowing for the partion of the table, maybe that has to do
>> with it?
>> >
>> > On Tue, Sep 12, 2017 at 11:25 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> >> Ok, something is going wrong then. It appears that your job created over
>> >> 14,000 BigQuery load jobs, which is not expected (and probably why
>> things
>> >> were so slow).
>> >>
>> >> On Tue, Sep 12, 2017 at 8:50 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >>
>> >>> no that specific job created only 2 tables
>> >>>
>> >>> On Tue, Sep 12, 2017 at 4:36 PM, Reuven Lax <re...@google.com.invalid>
>> >>> wrote:
>> >>> > It looks like your job is creating about 14,45 distinct BigQuery
>> tables.
>> >>> > Does that sound correct to you?
>> >>> >
>> >>> > Reuven
>> >>> >
>> >>> > On Tue, Sep 12, 2017 at 6:22 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >>> >
>> >>> >> the job id is 2017-09-12_02_57_55-5233544151932101752
>> >>> >> as you can see the majority of the time is inserting into bigquery.
>> >>> >> is there any way to parallel this?
>> >>> >>
>> >>> >> My feeling for the windowing is that writing should be done per
>> window
>> >>> >> (my window is daily) or at least to be able to configure it
>> >>> >>
>> >>> >> chaim
>> >>> >>
>> >>> >> On Tue, Sep 12, 2017 at 4:10 PM, Reuven Lax
>> <re...@google.com.invalid>
>> >>> >> wrote:
>> >>> >> > So the problem is you are running on Dataflow, and it's taking
>> longer
>> >>> >> than
>> >>> >> > you think it should? If you provide the Dataflow job id we can
>> help
>> >>> you
>> >>> >> > debug why it's taking 30 minutes. (and as an aside, if this turns
>> >>> into a
>> >>> >> > Dataflow debugging session we should move it off of the Beam list
>> and
>> >>> >> onto
>> >>> >> > a Dataflow-specific tread)
>> >>> >> >
>> >>> >> > Reuven
>> >>> >> >
>> >>> >> > On Tue, Sep 12, 2017 at 3:28 AM, Chaim Turkel <ch...@behalf.com>
>> >>> wrote:
>> >>> >> >
>> >>> >> >> is there a way around this, my time for 13gb is not close to 30
>> >>> >> >> minutes, while it should be around 15 minutes.
>> >>> >> >> Do i need to chunk the code myself to windows, and run in
>> parallel?
>> >>> >> >> chaim
>> >>> >> >>
>> >>> >> >> On Sun, Se

Re: BigQueryIO withSchemaFromView

2017-09-13 Thread Chaim Turkel
from what i found it I have the windowing with bigquery partition (per
day - 1545 partitions) the insert can take 5 hours, where if there is
no partitions then it takes about 12 minutes

I have 13,843,080 recrods 6.76 GB.
Any ideas how to get the partition to work faster.

Is there a way to get the BigQueryIO to use streaming and not jobs?

chaim

On Tue, Sep 12, 2017 at 11:32 PM, Chaim Turkel <ch...@behalf.com> wrote:
> i am using windowing for the partion of the table, maybe that has to do with 
> it?
>
> On Tue, Sep 12, 2017 at 11:25 PM, Reuven Lax <re...@google.com.invalid> wrote:
>> Ok, something is going wrong then. It appears that your job created over
>> 14,000 BigQuery load jobs, which is not expected (and probably why things
>> were so slow).
>>
>> On Tue, Sep 12, 2017 at 8:50 AM, Chaim Turkel <ch...@behalf.com> wrote:
>>
>>> no that specific job created only 2 tables
>>>
>>> On Tue, Sep 12, 2017 at 4:36 PM, Reuven Lax <re...@google.com.invalid>
>>> wrote:
>>> > It looks like your job is creating about 14,45 distinct BigQuery tables.
>>> > Does that sound correct to you?
>>> >
>>> > Reuven
>>> >
>>> > On Tue, Sep 12, 2017 at 6:22 AM, Chaim Turkel <ch...@behalf.com> wrote:
>>> >
>>> >> the job id is 2017-09-12_02_57_55-5233544151932101752
>>> >> as you can see the majority of the time is inserting into bigquery.
>>> >> is there any way to parallel this?
>>> >>
>>> >> My feeling for the windowing is that writing should be done per window
>>> >> (my window is daily) or at least to be able to configure it
>>> >>
>>> >> chaim
>>> >>
>>> >> On Tue, Sep 12, 2017 at 4:10 PM, Reuven Lax <re...@google.com.invalid>
>>> >> wrote:
>>> >> > So the problem is you are running on Dataflow, and it's taking longer
>>> >> than
>>> >> > you think it should? If you provide the Dataflow job id we can help
>>> you
>>> >> > debug why it's taking 30 minutes. (and as an aside, if this turns
>>> into a
>>> >> > Dataflow debugging session we should move it off of the Beam list and
>>> >> onto
>>> >> > a Dataflow-specific tread)
>>> >> >
>>> >> > Reuven
>>> >> >
>>> >> > On Tue, Sep 12, 2017 at 3:28 AM, Chaim Turkel <ch...@behalf.com>
>>> wrote:
>>> >> >
>>> >> >> is there a way around this, my time for 13gb is not close to 30
>>> >> >> minutes, while it should be around 15 minutes.
>>> >> >> Do i need to chunk the code myself to windows, and run in parallel?
>>> >> >> chaim
>>> >> >>
>>> >> >> On Sun, Sep 10, 2017 at 6:32 PM, Reuven Lax <re...@google.com.invalid
>>> >
>>> >> >> wrote:
>>> >> >> > In that case I can say unequivocally that Dataflow (in batch mode)
>>> >> does
>>> >> >> not
>>> >> >> > produce results for a stage until it has processed that entire
>>> stage.
>>> >> The
>>> >> >> > reason for this is that the batch runner is optimized for
>>> throughput,
>>> >> not
>>> >> >> > latency; it wants to minimize the time for the entire job to
>>> finish,
>>> >> not
>>> >> >> > the time till first output. The side input will not be materialized
>>> >> until
>>> >> >> > all of the data for all of the windows of the side input have been
>>> >> >> > processed. The streaming runner on the other hand will produce
>>> >> windows as
>>> >> >> > they finish. So for the batch runner, there is no performance
>>> >> advantage
>>> >> >> you
>>> >> >> > get for windowing the side input.
>>> >> >> >
>>> >> >> > The fact that BigQueryIO needs the schema side input to be globally
>>> >> >> > windowed is a bit confusing and not well documented. We should add
>>> >> better
>>> >> >> > javadoc explaining this.
>>> >> >> >
>>> >> >> > Reuven
>>> >> >> >
>>> >> >> > On Sun, Sep 10, 2017 at 12

Re: BigQueryIO withSchemaFromView

2017-09-12 Thread Chaim Turkel
i am using windowing for the partion of the table, maybe that has to do with it?

On Tue, Sep 12, 2017 at 11:25 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Ok, something is going wrong then. It appears that your job created over
> 14,000 BigQuery load jobs, which is not expected (and probably why things
> were so slow).
>
> On Tue, Sep 12, 2017 at 8:50 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> no that specific job created only 2 tables
>>
>> On Tue, Sep 12, 2017 at 4:36 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > It looks like your job is creating about 14,45 distinct BigQuery tables.
>> > Does that sound correct to you?
>> >
>> > Reuven
>> >
>> > On Tue, Sep 12, 2017 at 6:22 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> the job id is 2017-09-12_02_57_55-5233544151932101752
>> >> as you can see the majority of the time is inserting into bigquery.
>> >> is there any way to parallel this?
>> >>
>> >> My feeling for the windowing is that writing should be done per window
>> >> (my window is daily) or at least to be able to configure it
>> >>
>> >> chaim
>> >>
>> >> On Tue, Sep 12, 2017 at 4:10 PM, Reuven Lax <re...@google.com.invalid>
>> >> wrote:
>> >> > So the problem is you are running on Dataflow, and it's taking longer
>> >> than
>> >> > you think it should? If you provide the Dataflow job id we can help
>> you
>> >> > debug why it's taking 30 minutes. (and as an aside, if this turns
>> into a
>> >> > Dataflow debugging session we should move it off of the Beam list and
>> >> onto
>> >> > a Dataflow-specific tread)
>> >> >
>> >> > Reuven
>> >> >
>> >> > On Tue, Sep 12, 2017 at 3:28 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> >> is there a way around this, my time for 13gb is not close to 30
>> >> >> minutes, while it should be around 15 minutes.
>> >> >> Do i need to chunk the code myself to windows, and run in parallel?
>> >> >> chaim
>> >> >>
>> >> >> On Sun, Sep 10, 2017 at 6:32 PM, Reuven Lax <re...@google.com.invalid
>> >
>> >> >> wrote:
>> >> >> > In that case I can say unequivocally that Dataflow (in batch mode)
>> >> does
>> >> >> not
>> >> >> > produce results for a stage until it has processed that entire
>> stage.
>> >> The
>> >> >> > reason for this is that the batch runner is optimized for
>> throughput,
>> >> not
>> >> >> > latency; it wants to minimize the time for the entire job to
>> finish,
>> >> not
>> >> >> > the time till first output. The side input will not be materialized
>> >> until
>> >> >> > all of the data for all of the windows of the side input have been
>> >> >> > processed. The streaming runner on the other hand will produce
>> >> windows as
>> >> >> > they finish. So for the batch runner, there is no performance
>> >> advantage
>> >> >> you
>> >> >> > get for windowing the side input.
>> >> >> >
>> >> >> > The fact that BigQueryIO needs the schema side input to be globally
>> >> >> > windowed is a bit confusing and not well documented. We should add
>> >> better
>> >> >> > javadoc explaining this.
>> >> >> >
>> >> >> > Reuven
>> >> >> >
>> >> >> > On Sun, Sep 10, 2017 at 12:50 AM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> >> batch on dataflow
>> >> >> >>
>> >> >> >> On Sun, Sep 10, 2017 at 8:05 AM, Reuven Lax
>> <re...@google.com.invalid
>> >> >
>> >> >> >> wrote:
>> >> >> >> > Which runner are you using? And is this a batch pipeline?
>> >> >> >> >
>> >> >> >> > On Sat, Sep 9, 2017 at 10:03 PM, Chaim Turkel <ch...@behalf.com
>> >
>> >> >> wrote:
>> >> >> >> >
>> >> >> >> >&

Re: BigQueryIO withSchemaFromView

2017-09-12 Thread Chaim Turkel
any idea how i can debug it or find the issue?

On Tue, Sep 12, 2017 at 11:25 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Ok, something is going wrong then. It appears that your job created over
> 14,000 BigQuery load jobs, which is not expected (and probably why things
> were so slow).
>
> On Tue, Sep 12, 2017 at 8:50 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> no that specific job created only 2 tables
>>
>> On Tue, Sep 12, 2017 at 4:36 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > It looks like your job is creating about 14,45 distinct BigQuery tables.
>> > Does that sound correct to you?
>> >
>> > Reuven
>> >
>> > On Tue, Sep 12, 2017 at 6:22 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> the job id is 2017-09-12_02_57_55-5233544151932101752
>> >> as you can see the majority of the time is inserting into bigquery.
>> >> is there any way to parallel this?
>> >>
>> >> My feeling for the windowing is that writing should be done per window
>> >> (my window is daily) or at least to be able to configure it
>> >>
>> >> chaim
>> >>
>> >> On Tue, Sep 12, 2017 at 4:10 PM, Reuven Lax <re...@google.com.invalid>
>> >> wrote:
>> >> > So the problem is you are running on Dataflow, and it's taking longer
>> >> than
>> >> > you think it should? If you provide the Dataflow job id we can help
>> you
>> >> > debug why it's taking 30 minutes. (and as an aside, if this turns
>> into a
>> >> > Dataflow debugging session we should move it off of the Beam list and
>> >> onto
>> >> > a Dataflow-specific tread)
>> >> >
>> >> > Reuven
>> >> >
>> >> > On Tue, Sep 12, 2017 at 3:28 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> >> is there a way around this, my time for 13gb is not close to 30
>> >> >> minutes, while it should be around 15 minutes.
>> >> >> Do i need to chunk the code myself to windows, and run in parallel?
>> >> >> chaim
>> >> >>
>> >> >> On Sun, Sep 10, 2017 at 6:32 PM, Reuven Lax <re...@google.com.invalid
>> >
>> >> >> wrote:
>> >> >> > In that case I can say unequivocally that Dataflow (in batch mode)
>> >> does
>> >> >> not
>> >> >> > produce results for a stage until it has processed that entire
>> stage.
>> >> The
>> >> >> > reason for this is that the batch runner is optimized for
>> throughput,
>> >> not
>> >> >> > latency; it wants to minimize the time for the entire job to
>> finish,
>> >> not
>> >> >> > the time till first output. The side input will not be materialized
>> >> until
>> >> >> > all of the data for all of the windows of the side input have been
>> >> >> > processed. The streaming runner on the other hand will produce
>> >> windows as
>> >> >> > they finish. So for the batch runner, there is no performance
>> >> advantage
>> >> >> you
>> >> >> > get for windowing the side input.
>> >> >> >
>> >> >> > The fact that BigQueryIO needs the schema side input to be globally
>> >> >> > windowed is a bit confusing and not well documented. We should add
>> >> better
>> >> >> > javadoc explaining this.
>> >> >> >
>> >> >> > Reuven
>> >> >> >
>> >> >> > On Sun, Sep 10, 2017 at 12:50 AM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> >> batch on dataflow
>> >> >> >>
>> >> >> >> On Sun, Sep 10, 2017 at 8:05 AM, Reuven Lax
>> <re...@google.com.invalid
>> >> >
>> >> >> >> wrote:
>> >> >> >> > Which runner are you using? And is this a batch pipeline?
>> >> >> >> >
>> >> >> >> > On Sat, Sep 9, 2017 at 10:03 PM, Chaim Turkel <ch...@behalf.com
>> >
>> >> >> wrote:
>> >> >> >> >
>> >> >> >> >> Thank for the answer, 

Re: BigQueryIO withSchemaFromView

2017-09-12 Thread Chaim Turkel
no that specific job created only 2 tables

On Tue, Sep 12, 2017 at 4:36 PM, Reuven Lax <re...@google.com.invalid> wrote:
> It looks like your job is creating about 14,45 distinct BigQuery tables.
> Does that sound correct to you?
>
> Reuven
>
> On Tue, Sep 12, 2017 at 6:22 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> the job id is 2017-09-12_02_57_55-5233544151932101752
>> as you can see the majority of the time is inserting into bigquery.
>> is there any way to parallel this?
>>
>> My feeling for the windowing is that writing should be done per window
>> (my window is daily) or at least to be able to configure it
>>
>> chaim
>>
>> On Tue, Sep 12, 2017 at 4:10 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > So the problem is you are running on Dataflow, and it's taking longer
>> than
>> > you think it should? If you provide the Dataflow job id we can help you
>> > debug why it's taking 30 minutes. (and as an aside, if this turns into a
>> > Dataflow debugging session we should move it off of the Beam list and
>> onto
>> > a Dataflow-specific tread)
>> >
>> > Reuven
>> >
>> > On Tue, Sep 12, 2017 at 3:28 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> is there a way around this, my time for 13gb is not close to 30
>> >> minutes, while it should be around 15 minutes.
>> >> Do i need to chunk the code myself to windows, and run in parallel?
>> >> chaim
>> >>
>> >> On Sun, Sep 10, 2017 at 6:32 PM, Reuven Lax <re...@google.com.invalid>
>> >> wrote:
>> >> > In that case I can say unequivocally that Dataflow (in batch mode)
>> does
>> >> not
>> >> > produce results for a stage until it has processed that entire stage.
>> The
>> >> > reason for this is that the batch runner is optimized for throughput,
>> not
>> >> > latency; it wants to minimize the time for the entire job to finish,
>> not
>> >> > the time till first output. The side input will not be materialized
>> until
>> >> > all of the data for all of the windows of the side input have been
>> >> > processed. The streaming runner on the other hand will produce
>> windows as
>> >> > they finish. So for the batch runner, there is no performance
>> advantage
>> >> you
>> >> > get for windowing the side input.
>> >> >
>> >> > The fact that BigQueryIO needs the schema side input to be globally
>> >> > windowed is a bit confusing and not well documented. We should add
>> better
>> >> > javadoc explaining this.
>> >> >
>> >> > Reuven
>> >> >
>> >> > On Sun, Sep 10, 2017 at 12:50 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> >> batch on dataflow
>> >> >>
>> >> >> On Sun, Sep 10, 2017 at 8:05 AM, Reuven Lax <re...@google.com.invalid
>> >
>> >> >> wrote:
>> >> >> > Which runner are you using? And is this a batch pipeline?
>> >> >> >
>> >> >> > On Sat, Sep 9, 2017 at 10:03 PM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> >> Thank for the answer, but i don't think that that is the case.
>> From
>> >> >> >> what i have seen, since i have other code to update status based
>> on
>> >> >> >> the window, it does get called before all the windows are
>> calculated.
>> >> >> >> There is no logical reason to wait, once the window has finished,
>> the
>> >> >> >> rest of the pipeline should run and the BigQuery should start to
>> >> write
>> >> >> >> the results.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Sat, Sep 9, 2017 at 10:48 PM, Reuven Lax
>> <re...@google.com.invalid
>> >> >
>> >> >> >> wrote:
>> >> >> >> > Logically the BigQuery write does not depend on windows, and
>> >> writing
>> >> >> it
>> >> >> >> > windowed would result in incorrect output. For this reason,
>> >> BigQueryIO
>> >> >> >> > rewindows int global windows before actually wri

Re: BigQueryIO withSchemaFromView

2017-09-12 Thread Chaim Turkel
the job id is 2017-09-12_02_57_55-5233544151932101752
as you can see the majority of the time is inserting into bigquery.
is there any way to parallel this?

My feeling for the windowing is that writing should be done per window
(my window is daily) or at least to be able to configure it

chaim

On Tue, Sep 12, 2017 at 4:10 PM, Reuven Lax <re...@google.com.invalid> wrote:
> So the problem is you are running on Dataflow, and it's taking longer than
> you think it should? If you provide the Dataflow job id we can help you
> debug why it's taking 30 minutes. (and as an aside, if this turns into a
> Dataflow debugging session we should move it off of the Beam list and onto
> a Dataflow-specific tread)
>
> Reuven
>
> On Tue, Sep 12, 2017 at 3:28 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> is there a way around this, my time for 13gb is not close to 30
>> minutes, while it should be around 15 minutes.
>> Do i need to chunk the code myself to windows, and run in parallel?
>> chaim
>>
>> On Sun, Sep 10, 2017 at 6:32 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > In that case I can say unequivocally that Dataflow (in batch mode) does
>> not
>> > produce results for a stage until it has processed that entire stage. The
>> > reason for this is that the batch runner is optimized for throughput, not
>> > latency; it wants to minimize the time for the entire job to finish, not
>> > the time till first output. The side input will not be materialized until
>> > all of the data for all of the windows of the side input have been
>> > processed. The streaming runner on the other hand will produce windows as
>> > they finish. So for the batch runner, there is no performance advantage
>> you
>> > get for windowing the side input.
>> >
>> > The fact that BigQueryIO needs the schema side input to be globally
>> > windowed is a bit confusing and not well documented. We should add better
>> > javadoc explaining this.
>> >
>> > Reuven
>> >
>> > On Sun, Sep 10, 2017 at 12:50 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> batch on dataflow
>> >>
>> >> On Sun, Sep 10, 2017 at 8:05 AM, Reuven Lax <re...@google.com.invalid>
>> >> wrote:
>> >> > Which runner are you using? And is this a batch pipeline?
>> >> >
>> >> > On Sat, Sep 9, 2017 at 10:03 PM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> >> Thank for the answer, but i don't think that that is the case. From
>> >> >> what i have seen, since i have other code to update status based on
>> >> >> the window, it does get called before all the windows are calculated.
>> >> >> There is no logical reason to wait, once the window has finished, the
>> >> >> rest of the pipeline should run and the BigQuery should start to
>> write
>> >> >> the results.
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Sat, Sep 9, 2017 at 10:48 PM, Reuven Lax <re...@google.com.invalid
>> >
>> >> >> wrote:
>> >> >> > Logically the BigQuery write does not depend on windows, and
>> writing
>> >> it
>> >> >> > windowed would result in incorrect output. For this reason,
>> BigQueryIO
>> >> >> > rewindows int global windows before actually writing to BigQuery.
>> >> >> >
>> >> >> > If you are running in batch mode, there is no performance
>> difference
>> >> >> > between windowed and unwindowed side inputs. I believe that all of
>> the
>> >> >> > batch runners wait until all windows are calculated before
>> >> materializing
>> >> >> > the output.
>> >> >> >
>> >> >> > Reuven
>> >> >> >
>> >> >> > On Sat, Sep 9, 2017 at 12:43 PM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> >> the schema depends on the data per window.
>> >> >> >> when i added the global window it works, but then i loose the
>> >> >> >> performance, since the secound stage of writing will begin only
>> after
>> >> >> >> the side input has read all the data and updated the schema
>> >> >> >> The batchmode of the BigqueryIO seems to use a global 

Re: BigQueryIO withSchemaFromView

2017-09-12 Thread Chaim Turkel
is there a way around this, my time for 13gb is not close to 30
minutes, while it should be around 15 minutes.
Do i need to chunk the code myself to windows, and run in parallel?
chaim

On Sun, Sep 10, 2017 at 6:32 PM, Reuven Lax <re...@google.com.invalid> wrote:
> In that case I can say unequivocally that Dataflow (in batch mode) does not
> produce results for a stage until it has processed that entire stage. The
> reason for this is that the batch runner is optimized for throughput, not
> latency; it wants to minimize the time for the entire job to finish, not
> the time till first output. The side input will not be materialized until
> all of the data for all of the windows of the side input have been
> processed. The streaming runner on the other hand will produce windows as
> they finish. So for the batch runner, there is no performance advantage you
> get for windowing the side input.
>
> The fact that BigQueryIO needs the schema side input to be globally
> windowed is a bit confusing and not well documented. We should add better
> javadoc explaining this.
>
> Reuven
>
> On Sun, Sep 10, 2017 at 12:50 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> batch on dataflow
>>
>> On Sun, Sep 10, 2017 at 8:05 AM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > Which runner are you using? And is this a batch pipeline?
>> >
>> > On Sat, Sep 9, 2017 at 10:03 PM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> Thank for the answer, but i don't think that that is the case. From
>> >> what i have seen, since i have other code to update status based on
>> >> the window, it does get called before all the windows are calculated.
>> >> There is no logical reason to wait, once the window has finished, the
>> >> rest of the pipeline should run and the BigQuery should start to write
>> >> the results.
>> >>
>> >>
>> >>
>> >> On Sat, Sep 9, 2017 at 10:48 PM, Reuven Lax <re...@google.com.invalid>
>> >> wrote:
>> >> > Logically the BigQuery write does not depend on windows, and writing
>> it
>> >> > windowed would result in incorrect output. For this reason, BigQueryIO
>> >> > rewindows int global windows before actually writing to BigQuery.
>> >> >
>> >> > If you are running in batch mode, there is no performance difference
>> >> > between windowed and unwindowed side inputs. I believe that all of the
>> >> > batch runners wait until all windows are calculated before
>> materializing
>> >> > the output.
>> >> >
>> >> > Reuven
>> >> >
>> >> > On Sat, Sep 9, 2017 at 12:43 PM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> >> the schema depends on the data per window.
>> >> >> when i added the global window it works, but then i loose the
>> >> >> performance, since the secound stage of writing will begin only after
>> >> >> the side input has read all the data and updated the schema
>> >> >> The batchmode of the BigqueryIO seems to use a global window that i
>> >> >> don't know why?
>> >> >>
>> >> >> chaim
>> >> >>
>> >> >> On Fri, Sep 8, 2017 at 9:26 PM, Eugene Kirpichov
>> >> >> <kirpic...@google.com.invalid> wrote:
>> >> >> > Are your schemas actually supposed to be different between
>> different
>> >> >> > windows, or do they depend only on data?
>> >> >> > I see you have a commented-out Window.into(new GlobalWindows()) for
>> >> your
>> >> >> > side input - did that work when it wasn't commented out?
>> >> >> >
>> >> >> > On Fri, Sep 8, 2017 at 2:17 AM Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >> >
>> >> >> >> my code is:
>> >> >> >>
>> >> >> >> //read docs from mongo
>> >> >> >> final PCollection docs = pipeline
>> >> >> >> .apply(table.getTableName(),
>> >> >> MongoDbIO.read()
>> >> >> >> .withUri("mongodb://" +
>> >> >> >> connectionParams)
>> >> >> >> .withFilter(filter)
>&g

Re: PipelineResult

2017-09-11 Thread Chaim Turkel
it would be much smarter if the BigQueryIO would return a collection of void


chaim

On Mon, Sep 11, 2017 at 8:47 AM, Reuven Lax <re...@google.com.invalid> wrote:
> On Sun, Sep 10, 2017 at 10:12 PM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> I am migrating multiple tables from mongo to bigquery. So i loop over
>> each table and create a PCollection for each table.
>> I would like to update a status row for each run (time, records...).
>> So I want to write the results at the end.
>> I would like to write a ParDo method after the sink but currently
>> bigqueryIO does not support this.
>>
>
> Makes sense. I wonder if BigQueryIO.Write should return a PCollection of
> metadata objects (table name, rows written, etc.) as part of the
> WriteResult.
>
>
>>
>> Using metrics is an option but then the client needs to block and wait
>> for the pipline to finish - and it might loose the information
>>
>
> If using the Dataflow runner, you can always use the job id to get the
> metrics even if the main program crashes.
>
> Chaim
>>
>> On Sun, Sep 10, 2017 at 8:05 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > Can you explain what you mean by multiple collections running on the
>> > pipeline? What do you need the results for?
>> >
>> > On Sat, Sep 9, 2017 at 10:45 PM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> Hi,
>> >>   I am having trouble figuring out what to do with the results. I have
>> >> multiple collections running on the pipeline, and since the sink does
>> >> not give me the option to get the result, i need to wait for the
>> >> pipeline to finish and then poll the results.
>> >> From what i can see my only option is to use the metrics, is there
>> >> another way to pass information from the collections to the results?
>> >>
>> >> chaim
>> >>
>>


Re: PipelineResult

2017-09-10 Thread Chaim Turkel
I am migrating multiple tables from mongo to bigquery. So i loop over
each table and create a PCollection for each table.
I would like to update a status row for each run (time, records...).
So I want to write the results at the end.
I would like to write a ParDo method after the sink but currently
bigqueryIO does not support this.

Using metrics is an option but then the client needs to block and wait
for the pipline to finish - and it might loose the information

Chaim

On Sun, Sep 10, 2017 at 8:05 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Can you explain what you mean by multiple collections running on the
> pipeline? What do you need the results for?
>
> On Sat, Sep 9, 2017 at 10:45 PM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> Hi,
>>   I am having trouble figuring out what to do with the results. I have
>> multiple collections running on the pipeline, and since the sink does
>> not give me the option to get the result, i need to wait for the
>> pipeline to finish and then poll the results.
>> From what i can see my only option is to use the metrics, is there
>> another way to pass information from the collections to the results?
>>
>> chaim
>>


Re: BigQueryIO withSchemaFromView

2017-09-10 Thread Chaim Turkel
batch on dataflow

On Sun, Sep 10, 2017 at 8:05 AM, Reuven Lax <re...@google.com.invalid> wrote:
> Which runner are you using? And is this a batch pipeline?
>
> On Sat, Sep 9, 2017 at 10:03 PM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> Thank for the answer, but i don't think that that is the case. From
>> what i have seen, since i have other code to update status based on
>> the window, it does get called before all the windows are calculated.
>> There is no logical reason to wait, once the window has finished, the
>> rest of the pipeline should run and the BigQuery should start to write
>> the results.
>>
>>
>>
>> On Sat, Sep 9, 2017 at 10:48 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > Logically the BigQuery write does not depend on windows, and writing it
>> > windowed would result in incorrect output. For this reason, BigQueryIO
>> > rewindows int global windows before actually writing to BigQuery.
>> >
>> > If you are running in batch mode, there is no performance difference
>> > between windowed and unwindowed side inputs. I believe that all of the
>> > batch runners wait until all windows are calculated before materializing
>> > the output.
>> >
>> > Reuven
>> >
>> > On Sat, Sep 9, 2017 at 12:43 PM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> the schema depends on the data per window.
>> >> when i added the global window it works, but then i loose the
>> >> performance, since the secound stage of writing will begin only after
>> >> the side input has read all the data and updated the schema
>> >> The batchmode of the BigqueryIO seems to use a global window that i
>> >> don't know why?
>> >>
>> >> chaim
>> >>
>> >> On Fri, Sep 8, 2017 at 9:26 PM, Eugene Kirpichov
>> >> <kirpic...@google.com.invalid> wrote:
>> >> > Are your schemas actually supposed to be different between different
>> >> > windows, or do they depend only on data?
>> >> > I see you have a commented-out Window.into(new GlobalWindows()) for
>> your
>> >> > side input - did that work when it wasn't commented out?
>> >> >
>> >> > On Fri, Sep 8, 2017 at 2:17 AM Chaim Turkel <ch...@behalf.com> wrote:
>> >> >
>> >> >> my code is:
>> >> >>
>> >> >> //read docs from mongo
>> >> >> final PCollection docs = pipeline
>> >> >> .apply(table.getTableName(),
>> >> MongoDbIO.read()
>> >> >> .withUri("mongodb://" +
>> >> >> connectionParams)
>> >> >> .withFilter(filter)
>> >> >> .withDatabase(options.
>> getDBName())
>> >> >> .withCollection(table.
>> >> getTableName()))
>> >> >> .apply("AddEventTimestamps",
>> >> >> WithTimestamps.of((Document doc) -> new
>> >> >> Instant(MongodbManagment.docTimeToLong(doc
>> >> >> .apply("Window Daily",
>> >> >> Window.into(CalendarWindows.days(1)));
>> >> >>
>> >> >> //update bq schema based on window
>> >> >> final PCollectionView<Map<String, String>>
>> >> >> tableSchemas = docs
>> >> >> //.apply("Global Window",Window.into(new
>> >> >> GlobalWindows()))
>> >> >> .apply("extract schema " +
>> >> >> table.getTableName(), new
>> >> >> LoadMongodbSchemaPipeline.DocsToSchemaTransform(table))
>> >> >> .apply("getTableSchemaMemory " +
>> >> >> table.getTableName(),
>> >> >> ParDo.of(getTableSchemaMemory(table.getTableName(
>> >> >> .apply(View.asMap());
>> >> >>
>> >> >> final PCollection docsRows = docs
>> >> >> .apply("doc to row " +
>> >> >> table.getTableNam

PipelineResult

2017-09-09 Thread Chaim Turkel
Hi,
  I am having trouble figuring out what to do with the results. I have
multiple collections running on the pipeline, and since the sink does
not give me the option to get the result, i need to wait for the
pipeline to finish and then poll the results.
>From what i can see my only option is to use the metrics, is there
another way to pass information from the collections to the results?

chaim


Re: writing status

2017-09-09 Thread Chaim Turkel
so what you are saying is that windowing is not supported on the
bigquery? this does not make sense since i am using it for the table
partition, and that works fine?

On Sat, Sep 9, 2017 at 7:40 PM, Steve Niemitz <sniem...@apache.org> wrote:
> I wonder if it makes sense to start simple and go from there.  For example,
> I enhanced BigtableIO.Write to output the number of rows written
> in finishBundle(), simply into the global window with the current
> timestamp.  This was more than enough to unblock me, but doesn't support
> more complicated scenarios with windowing.
>
> However, as I said it was more than enough to solve the general batch use
> case, and I imagine could be enhanced to support windowing by keeping track
> of which windows were written per bundle. (can there even ever be more than
> one window per bundle?)
>
> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
>> Hi,
>> I was going to implement this, but discussed it with +Reuven Lax
>> <re...@google.com> and it appears to be quite difficult to do properly, or
>> even to define what it means at all, especially if you're using the
>> streaming inserts write method. So for now there is no workaround except
>> programmatically waiting for your whole pipeline to finish
>> (pipeline.run().waitUntilFinish()).
>>
>> On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
>>
>> > is there a way around this for now?
>> > how can i get a snapshot version?
>> >
>> > chaim
>> >
>> > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
>> > <kirpic...@google.com.invalid> wrote:
>> > > Oh I see! Okay, this should be easy to fix. I'll take a look.
>> > >
>> > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
>> > >
>> > >> WriteResult does not support apply -> that is the problem
>> > >>
>> > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> > >> <kirpic...@google.com.invalid> wrote:
>> > >> > Hi,
>> > >> >
>> > >> > Sorry for the delay. So sounds like you want to do something after
>> > >> writing
>> > >> > a window of data to BigQuery is complete.
>> > >> > I think this should be possible: expansion of BigQueryIO.write()
>> > returns
>> > >> a
>> > >> > WriteResult and you can apply other transforms to it. Have you tried
>> > >> that?
>> > >> >
>> > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
>> > wrote:
>> > >> >
>> > >> >> I have documents from a mongo db that i need to migrate to
>> bigquery.
>> > >> >> Since it is mongodb i do not know they schema ahead of time, so i
>> > have
>> > >> >> two pipelines, one to run over the documents and update the
>> bigquery
>> > >> >> schema, then wait a few minutes (i can take for bigquery to be able
>> > to
>> > >> >> use the new schema) then with the other pipline copy all the
>> > >> >> documents.
>> > >> >> To know as to where i got with the different piplines i have a
>> status
>> > >> >> table so that at the start i know from where to continue.
>> > >> >> So i need the option to update the status table with the success of
>> > >> >> the copy and some time value of the last copied document
>> > >> >>
>> > >> >>
>> > >> >> chaim
>> > >> >>
>> > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> > >> >> <kirpic...@google.com.invalid> wrote:
>> > >> >> > I'd like to know more about your both use cases, can you
>> clarify? I
>> > >> think
>> > >> >> > making sinks output something that can be waited on by another
>> > >> pipeline
>> > >> >> > step is a reasonable request, but more details would help refine
>> > this
>> > >> >> > suggestion.
>> > >> >> >
>> > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> > >> chamik...@apache.org>
>> > >> >> > wrote:
>> > >> >> >
>> > >> >> >> Can you do this 

Re: writing status

2017-09-09 Thread Chaim Turkel
so how does the getFailedInserts method work (though from what i saw
it does not work)

chaim

On Sat, Sep 9, 2017 at 9:49 PM, Reuven Lax <re...@google.com.invalid> wrote:
> I'm still not sure how this would work (or even make sense) for the
> streaming-write path.
>
> Also in both paths, the actual write to BigQuery is unwindowed.
>
> On Sat, Sep 9, 2017 at 11:44 AM, Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> There'd be 1 Void per pane per window, so I could extract information
>> about whether this is the first pane, last pane, or something else - there
>> are probably use cases for each of these.
>>
>> On Sat, Sep 9, 2017 at 11:37 AM Reuven Lax <re...@google.com> wrote:
>>
>>> How would you know how many Voids to wait for downstream?
>>>
>>> On Sat, Sep 9, 2017 at 10:46 AM, Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Hi Steve,
>>>> Unfortunately for BigQuery it's more complicated than that. Rows aren't
>>>> written to BigQuery one by one (unless you're using streaming inserts,
>>>> which are way more expensive and are usually used only in streaming
>>>> pipelines) - they are written to files, and then a BigQuery import job, or
>>>> several import jobs if there are too many files, picks them up. We can
>>>> declare writing complete when all of the BigQuery import jobs have
>>>> successfully completed.
>>>> However, the method of writing is an implementation detail of BigQuery,
>>>> so we need to create an API that works regardless of the method (import
>>>> jobs vs. streaming inserts).
>>>> Another complication is triggering - windows can fire multiple times.
>>>> This rules out any approaches that sequence using side inputs, because side
>>>> inputs don't have triggering.
>>>>
>>>> I think a common approach could be to return a PCollection,
>>>> containing a Void in every window and pane that has been successfully
>>>> written. This could be implemented in both modes and could be a general
>>>> design patterns for this sort of thing. It just isn't easy to implement, so
>>>> I didn't have time to take it on. It also could turn out to have other
>>>> complications we haven't thought of yet.
>>>>
>>>> That said, if somebody tried to implement this for some connectors (not
>>>> necessarily BigQuery) and pioneered the approach, it would be a great
>>>> contribution.
>>>>
>>>> On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <sniem...@apache.org>
>>>> wrote:
>>>>
>>>>> I wonder if it makes sense to start simple and go from there.  For
>>>>> example,
>>>>> I enhanced BigtableIO.Write to output the number of rows written
>>>>> in finishBundle(), simply into the global window with the current
>>>>> timestamp.  This was more than enough to unblock me, but doesn't support
>>>>> more complicated scenarios with windowing.
>>>>>
>>>>> However, as I said it was more than enough to solve the general batch
>>>>> use
>>>>> case, and I imagine could be enhanced to support windowing by keeping
>>>>> track
>>>>> of which windows were written per bundle. (can there even ever be more
>>>>> than
>>>>> one window per bundle?)
>>>>>
>>>>> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
>>>>> kirpic...@google.com.invalid> wrote:
>>>>>
>>>>> > Hi,
>>>>> > I was going to implement this, but discussed it with +Reuven Lax
>>>>> > <re...@google.com> and it appears to be quite difficult to do
>>>>> properly, or
>>>>> > even to define what it means at all, especially if you're using the
>>>>> > streaming inserts write method. So for now there is no workaround
>>>>> except
>>>>> > programmatically waiting for your whole pipeline to finish
>>>>> > (pipeline.run().waitUntilFinish()).
>>>>> >
>>>>> > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
>>>>> >
>>>>> > > is there a way around this for now?
>>>>> > > how can i get a snapshot version?
>>>>> > >
>>>>> > > chaim
>>>>> > >
>>>>> > > On Tue

Re: BigQueryIO withSchemaFromView

2017-09-09 Thread Chaim Turkel
Thank for the answer, but i don't think that that is the case. From
what i have seen, since i have other code to update status based on
the window, it does get called before all the windows are calculated.
There is no logical reason to wait, once the window has finished, the
rest of the pipeline should run and the BigQuery should start to write
the results.



On Sat, Sep 9, 2017 at 10:48 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Logically the BigQuery write does not depend on windows, and writing it
> windowed would result in incorrect output. For this reason, BigQueryIO
> rewindows int global windows before actually writing to BigQuery.
>
> If you are running in batch mode, there is no performance difference
> between windowed and unwindowed side inputs. I believe that all of the
> batch runners wait until all windows are calculated before materializing
> the output.
>
> Reuven
>
> On Sat, Sep 9, 2017 at 12:43 PM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> the schema depends on the data per window.
>> when i added the global window it works, but then i loose the
>> performance, since the secound stage of writing will begin only after
>> the side input has read all the data and updated the schema
>> The batchmode of the BigqueryIO seems to use a global window that i
>> don't know why?
>>
>> chaim
>>
>> On Fri, Sep 8, 2017 at 9:26 PM, Eugene Kirpichov
>> <kirpic...@google.com.invalid> wrote:
>> > Are your schemas actually supposed to be different between different
>> > windows, or do they depend only on data?
>> > I see you have a commented-out Window.into(new GlobalWindows()) for your
>> > side input - did that work when it wasn't commented out?
>> >
>> > On Fri, Sep 8, 2017 at 2:17 AM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> my code is:
>> >>
>> >> //read docs from mongo
>> >> final PCollection docs = pipeline
>> >> .apply(table.getTableName(),
>> MongoDbIO.read()
>> >> .withUri("mongodb://" +
>> >> connectionParams)
>> >> .withFilter(filter)
>> >> .withDatabase(options.getDBName())
>> >> .withCollection(table.
>> getTableName()))
>> >> .apply("AddEventTimestamps",
>> >> WithTimestamps.of((Document doc) -> new
>> >> Instant(MongodbManagment.docTimeToLong(doc
>> >> .apply("Window Daily",
>> >> Window.into(CalendarWindows.days(1)));
>> >>
>> >> //update bq schema based on window
>> >> final PCollectionView<Map<String, String>>
>> >> tableSchemas = docs
>> >> //.apply("Global Window",Window.into(new
>> >> GlobalWindows()))
>> >> .apply("extract schema " +
>> >> table.getTableName(), new
>> >> LoadMongodbSchemaPipeline.DocsToSchemaTransform(table))
>> >> .apply("getTableSchemaMemory " +
>> >> table.getTableName(),
>> >> ParDo.of(getTableSchemaMemory(table.getTableName(
>> >> .apply(View.asMap());
>> >>
>> >> final PCollection docsRows = docs
>> >> .apply("doc to row " +
>> >> table.getTableName(), ParDo.of(docToTableRow(table.getBqTableName(),
>> >> tableSchemas))
>> >> .withSideInputs(tableSchemas));
>> >>
>> >> final WriteResult apply = docsRows
>> >> .apply("insert data table - " +
>> >> table.getTableName(),
>> >> BigQueryIO.writeTableRows()
>> >>
>> >> .to(TableRefPartition.perDay(options.getBQProject(),
>> >> options.getDatasetId(), table.getBqTableName()))
>> >>
>> >> .withSchemaFromView(tableSchemas)
>> >>
>> >> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_
>> NEEDED)
>> >>
>> >> .withWriteDisposition(WRITE_APPEND));
>> >>
>> >>
>> >> exception is:
>> >>
>&g

Re: BigQueryIO withSchemaFromView

2017-09-09 Thread Chaim Turkel
the schema depends on the data per window.
when i added the global window it works, but then i loose the
performance, since the second stage of writing will begin only after
the side input has read all the data and updated the schema
The batchmode of the BigqueryIO seems to use a global window that i
don't know why?

chaim

On Fri, Sep 8, 2017 at 9:26 PM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Are your schemas actually supposed to be different between different
> windows, or do they depend only on data?
> I see you have a commented-out Window.into(new GlobalWindows()) for your
> side input - did that work when it wasn't commented out?
>
> On Fri, Sep 8, 2017 at 2:17 AM Chaim Turkel <ch...@behalf.com> wrote:
>
>> my code is:
>>
>> //read docs from mongo
>> final PCollection docs = pipeline
>> .apply(table.getTableName(), MongoDbIO.read()
>> .withUri("mongodb://" +
>> connectionParams)
>> .withFilter(filter)
>> .withDatabase(options.getDBName())
>> .withCollection(table.getTableName()))
>> .apply("AddEventTimestamps",
>> WithTimestamps.of((Document doc) -> new
>> Instant(MongodbManagment.docTimeToLong(doc
>> .apply("Window Daily",
>> Window.into(CalendarWindows.days(1)));
>>
>> //update bq schema based on window
>> final PCollectionView<Map<String, String>>
>> tableSchemas = docs
>> //.apply("Global Window",Window.into(new
>> GlobalWindows()))
>> .apply("extract schema " +
>> table.getTableName(), new
>> LoadMongodbSchemaPipeline.DocsToSchemaTransform(table))
>> .apply("getTableSchemaMemory " +
>> table.getTableName(),
>> ParDo.of(getTableSchemaMemory(table.getTableName(
>> .apply(View.asMap());
>>
>> final PCollection docsRows = docs
>> .apply("doc to row " +
>> table.getTableName(), ParDo.of(docToTableRow(table.getBqTableName(),
>> tableSchemas))
>> .withSideInputs(tableSchemas));
>>
>> final WriteResult apply = docsRows
>> .apply("insert data table - " +
>> table.getTableName(),
>> BigQueryIO.writeTableRows()
>>
>> .to(TableRefPartition.perDay(options.getBQProject(),
>> options.getDatasetId(), table.getBqTableName()))
>>
>> .withSchemaFromView(tableSchemas)
>>
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>>
>> .withWriteDisposition(WRITE_APPEND));
>>
>>
>> exception is:
>>
>> Sep 08, 2017 12:16:55 PM
>> org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter 
>> INFO: Opening TableRowWriter to
>>
>> gs://bq-migration/tempMongo/BigQueryWriteTemp/7ae420d6f9eb41e488d2015d2e4d200d/cb3f0aef-9aeb-47ac-93dc-d9a12e4fdcfb.
>> Exception in thread "main"
>> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
>> java.lang.IllegalArgumentException: Attempted to get side input window
>> for GlobalWindow from non-global WindowFn
>> at
>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:331)
>> at
>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:301)
>> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
>> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
>> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
>> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
>> at
>> com.behalf.migration.dataflow.mongodb.LoadMongodbDataPipeline.runPipeline(LoadMongodbDataPipeline.java:347)
>> at
>> com.behalf.migration.dataflow.mongodb.LoadMongodbDataPipeline.main(LoadMongodbDataPipeline.java:372)
>> Caused by: java.lang.IllegalArgumentException: Attempted to get side
>> input window for GlobalWindow from non-global WindowFn
>> at
>> org.apache.beam.sdk.transforms.windowing.PartitioningWindowFn$1.getSideInputWindow(PartitioningWindowFn.java:49)
>> at
>> org.apache.beam.runners.direct.repackaged.runners.core.SimplePushbackSideInputDoFnRunner.

Re: BigQueryIO withSchemaFromView

2017-09-09 Thread Chaim Turkel
the schema depends on the data per window.
when i added the global window it works, but then i loose the
performance, since the secound stage of writing will begin only after
the side input has read all the data and updated the schema
The batchmode of the BigqueryIO seems to use a global window that i
don't know why?

chaim

On Fri, Sep 8, 2017 at 9:26 PM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Are your schemas actually supposed to be different between different
> windows, or do they depend only on data?
> I see you have a commented-out Window.into(new GlobalWindows()) for your
> side input - did that work when it wasn't commented out?
>
> On Fri, Sep 8, 2017 at 2:17 AM Chaim Turkel <ch...@behalf.com> wrote:
>
>> my code is:
>>
>> //read docs from mongo
>> final PCollection docs = pipeline
>> .apply(table.getTableName(), MongoDbIO.read()
>> .withUri("mongodb://" +
>> connectionParams)
>> .withFilter(filter)
>> .withDatabase(options.getDBName())
>> .withCollection(table.getTableName()))
>> .apply("AddEventTimestamps",
>> WithTimestamps.of((Document doc) -> new
>> Instant(MongodbManagment.docTimeToLong(doc
>> .apply("Window Daily",
>> Window.into(CalendarWindows.days(1)));
>>
>> //update bq schema based on window
>> final PCollectionView<Map<String, String>>
>> tableSchemas = docs
>> //.apply("Global Window",Window.into(new
>> GlobalWindows()))
>> .apply("extract schema " +
>> table.getTableName(), new
>> LoadMongodbSchemaPipeline.DocsToSchemaTransform(table))
>> .apply("getTableSchemaMemory " +
>> table.getTableName(),
>> ParDo.of(getTableSchemaMemory(table.getTableName(
>> .apply(View.asMap());
>>
>> final PCollection docsRows = docs
>> .apply("doc to row " +
>> table.getTableName(), ParDo.of(docToTableRow(table.getBqTableName(),
>> tableSchemas))
>> .withSideInputs(tableSchemas));
>>
>> final WriteResult apply = docsRows
>> .apply("insert data table - " +
>> table.getTableName(),
>> BigQueryIO.writeTableRows()
>>
>> .to(TableRefPartition.perDay(options.getBQProject(),
>> options.getDatasetId(), table.getBqTableName()))
>>
>> .withSchemaFromView(tableSchemas)
>>
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>>
>> .withWriteDisposition(WRITE_APPEND));
>>
>>
>> exception is:
>>
>> Sep 08, 2017 12:16:55 PM
>> org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter 
>> INFO: Opening TableRowWriter to
>>
>> gs://bq-migration/tempMongo/BigQueryWriteTemp/7ae420d6f9eb41e488d2015d2e4d200d/cb3f0aef-9aeb-47ac-93dc-d9a12e4fdcfb.
>> Exception in thread "main"
>> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
>> java.lang.IllegalArgumentException: Attempted to get side input window
>> for GlobalWindow from non-global WindowFn
>> at
>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:331)
>> at
>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:301)
>> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
>> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
>> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
>> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
>> at
>> com.behalf.migration.dataflow.mongodb.LoadMongodbDataPipeline.runPipeline(LoadMongodbDataPipeline.java:347)
>> at
>> com.behalf.migration.dataflow.mongodb.LoadMongodbDataPipeline.main(LoadMongodbDataPipeline.java:372)
>> Caused by: java.lang.IllegalArgumentException: Attempted to get side
>> input window for GlobalWindow from non-global WindowFn
>> at
>> org.apache.beam.sdk.transforms.windowing.PartitioningWindowFn$1.getSideInputWindow(PartitioningWindowFn.java:49)
>> at
>> org.apache.beam.runners.direct.repackaged.runners.core.SimplePushbackSideInputDoFnRunner.

Re: writing status

2017-09-08 Thread Chaim Turkel
is there a way around this for now?
how can i get a snapshot version?

chaim

On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Oh I see! Okay, this should be easy to fix. I'll take a look.
>
> On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> WriteResult does not support apply -> that is the problem
>>
>> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> <kirpic...@google.com.invalid> wrote:
>> > Hi,
>> >
>> > Sorry for the delay. So sounds like you want to do something after
>> writing
>> > a window of data to BigQuery is complete.
>> > I think this should be possible: expansion of BigQueryIO.write() returns
>> a
>> > WriteResult and you can apply other transforms to it. Have you tried
>> that?
>> >
>> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> I have documents from a mongo db that i need to migrate to bigquery.
>> >> Since it is mongodb i do not know they schema ahead of time, so i have
>> >> two pipelines, one to run over the documents and update the bigquery
>> >> schema, then wait a few minutes (i can take for bigquery to be able to
>> >> use the new schema) then with the other pipline copy all the
>> >> documents.
>> >> To know as to where i got with the different piplines i have a status
>> >> table so that at the start i know from where to continue.
>> >> So i need the option to update the status table with the success of
>> >> the copy and some time value of the last copied document
>> >>
>> >>
>> >> chaim
>> >>
>> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> >> <kirpic...@google.com.invalid> wrote:
>> >> > I'd like to know more about your both use cases, can you clarify? I
>> think
>> >> > making sinks output something that can be waited on by another
>> pipeline
>> >> > step is a reasonable request, but more details would help refine this
>> >> > suggestion.
>> >> >
>> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> chamik...@apache.org>
>> >> > wrote:
>> >> >
>> >> >> Can you do this from the program that runs the Beam job, after job is
>> >> >> complete (you might have to use a blocking runner or poll for the
>> >> status of
>> >> >> the job) ?
>> >> >>
>> >> >> - Cham
>> >> >>
>> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sniem...@apache.org>
>> >> wrote:
>> >> >>
>> >> >> > I also have a similar use case (but with BigTable) that I feel
>> like I
>> >> had
>> >> >> > to hack up to make work.  It'd be great to hear if there is a way
>> to
>> >> do
>> >> >> > something like this already, or if there are plans in the future.
>> >> >> >
>> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> > > Hi,
>> >> >> > >   I have a few piplines that are an ETL from different systems to
>> >> >> > bigquery.
>> >> >> > > I would like to write the status of the ETL after all records
>> have
>> >> >> > > been updated to the bigquery.
>> >> >> > > The problem is that writing to bigquery is a sink and you cannot
>> >> have
>> >> >> > > any other steps after the sink.
>> >> >> > > I tried a sideoutput, but this is called in no correlation to the
>> >> >> > > writing to bigquery, so i don't know if it succeeded or failed.
>> >> >> > >
>> >> >> > >
>> >> >> > > any ideas?
>> >> >> > > chaim
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>


Re: BigQueryIO withSchemaFromView

2017-09-08 Thread Chaim Turkel
my code is:

//read docs from mongo
final PCollection docs = pipeline
.apply(table.getTableName(), MongoDbIO.read()
.withUri("mongodb://" + connectionParams)
.withFilter(filter)
.withDatabase(options.getDBName())
.withCollection(table.getTableName()))
.apply("AddEventTimestamps",
WithTimestamps.of((Document doc) -> new
Instant(MongodbManagment.docTimeToLong(doc
.apply("Window Daily",
Window.into(CalendarWindows.days(1)));

//update bq schema based on window
final PCollectionView<Map<String, String>>
tableSchemas = docs
//.apply("Global Window",Window.into(new
GlobalWindows()))
.apply("extract schema " +
table.getTableName(), new
LoadMongodbSchemaPipeline.DocsToSchemaTransform(table))
.apply("getTableSchemaMemory " +
table.getTableName(),
ParDo.of(getTableSchemaMemory(table.getTableName(
.apply(View.asMap());

final PCollection docsRows = docs
.apply("doc to row " +
table.getTableName(), ParDo.of(docToTableRow(table.getBqTableName(),
tableSchemas))
.withSideInputs(tableSchemas));

final WriteResult apply = docsRows
.apply("insert data table - " +
table.getTableName(),
BigQueryIO.writeTableRows()

.to(TableRefPartition.perDay(options.getBQProject(),
options.getDatasetId(), table.getBqTableName()))
.withSchemaFromView(tableSchemas)

.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)

.withWriteDisposition(WRITE_APPEND));


exception is:

Sep 08, 2017 12:16:55 PM
org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter 
INFO: Opening TableRowWriter to
gs://bq-migration/tempMongo/BigQueryWriteTemp/7ae420d6f9eb41e488d2015d2e4d200d/cb3f0aef-9aeb-47ac-93dc-d9a12e4fdcfb.
Exception in thread "main"
org.apache.beam.sdk.Pipeline$PipelineExecutionException:
java.lang.IllegalArgumentException: Attempted to get side input window
for GlobalWindow from non-global WindowFn
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:331)
at 
org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:301)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
at 
com.behalf.migration.dataflow.mongodb.LoadMongodbDataPipeline.runPipeline(LoadMongodbDataPipeline.java:347)
at 
com.behalf.migration.dataflow.mongodb.LoadMongodbDataPipeline.main(LoadMongodbDataPipeline.java:372)
Caused by: java.lang.IllegalArgumentException: Attempted to get side
input window for GlobalWindow from non-global WindowFn
at 
org.apache.beam.sdk.transforms.windowing.PartitioningWindowFn$1.getSideInputWindow(PartitioningWindowFn.java:49)
at 
org.apache.beam.runners.direct.repackaged.runners.core.SimplePushbackSideInputDoFnRunner.isReady(SimplePushbackSideInputDoFnRunner.java:94)
at 
org.apache.beam.runners.direct.repackaged.runners.core.SimplePushbackSideInputDoFnRunner.processElementInReadyWindows(SimplePushbackSideInputDoFnRunner.java:76)
Sep 08, 2017 12:16:58 PM
org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter 

On Thu, Sep 7, 2017 at 7:07 PM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Please include the full exception and please show the code that produces it.
> See also
> https://beam.apache.org/documentation/programming-guide/#transforms-sideio
> section
> "Side inputs and windowing" - that might be sufficient to resolve your
> problem.
>
> On Thu, Sep 7, 2017 at 5:10 AM Chaim Turkel <ch...@behalf.com> wrote:
>
>> Hi,
>>   I have a pipline that bases on documents from mongo updates the
>> schema and then adds the records to mongo. Since i want a partitioned
>> table, i have a dally window.
>> How do i get the schema view to be a window, i get the exception of:
>>
>> Attempted to get side input window for GlobalWindow from non-global
>> WindowFn"
>>
>> chaim
>>


BigQueryIO withSchemaFromView

2017-09-07 Thread Chaim Turkel
Hi,
  I have a pipline that bases on documents from mongo updates the
schema and then adds the records to mongo. Since i want a partitioned
table, i have a dally window.
How do i get the schema view to be a window, i get the exception of:

Attempted to get side input window for GlobalWindow from non-global WindowFn"

chaim


Re: writing status

2017-09-05 Thread Chaim Turkel
also i think the getFailedInserts does not work. I expected for the
write to work, and for getFailedInserts to return the records that did
not, but the whole batch failed

chaim

On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Oh I see! Okay, this should be easy to fix. I'll take a look.
>
> On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> WriteResult does not support apply -> that is the problem
>>
>> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> <kirpic...@google.com.invalid> wrote:
>> > Hi,
>> >
>> > Sorry for the delay. So sounds like you want to do something after
>> writing
>> > a window of data to BigQuery is complete.
>> > I think this should be possible: expansion of BigQueryIO.write() returns
>> a
>> > WriteResult and you can apply other transforms to it. Have you tried
>> that?
>> >
>> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> I have documents from a mongo db that i need to migrate to bigquery.
>> >> Since it is mongodb i do not know they schema ahead of time, so i have
>> >> two pipelines, one to run over the documents and update the bigquery
>> >> schema, then wait a few minutes (i can take for bigquery to be able to
>> >> use the new schema) then with the other pipline copy all the
>> >> documents.
>> >> To know as to where i got with the different piplines i have a status
>> >> table so that at the start i know from where to continue.
>> >> So i need the option to update the status table with the success of
>> >> the copy and some time value of the last copied document
>> >>
>> >>
>> >> chaim
>> >>
>> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> >> <kirpic...@google.com.invalid> wrote:
>> >> > I'd like to know more about your both use cases, can you clarify? I
>> think
>> >> > making sinks output something that can be waited on by another
>> pipeline
>> >> > step is a reasonable request, but more details would help refine this
>> >> > suggestion.
>> >> >
>> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> chamik...@apache.org>
>> >> > wrote:
>> >> >
>> >> >> Can you do this from the program that runs the Beam job, after job is
>> >> >> complete (you might have to use a blocking runner or poll for the
>> >> status of
>> >> >> the job) ?
>> >> >>
>> >> >> - Cham
>> >> >>
>> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sniem...@apache.org>
>> >> wrote:
>> >> >>
>> >> >> > I also have a similar use case (but with BigTable) that I feel
>> like I
>> >> had
>> >> >> > to hack up to make work.  It'd be great to hear if there is a way
>> to
>> >> do
>> >> >> > something like this already, or if there are plans in the future.
>> >> >> >
>> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> > > Hi,
>> >> >> > >   I have a few piplines that are an ETL from different systems to
>> >> >> > bigquery.
>> >> >> > > I would like to write the status of the ETL after all records
>> have
>> >> >> > > been updated to the bigquery.
>> >> >> > > The problem is that writing to bigquery is a sink and you cannot
>> >> have
>> >> >> > > any other steps after the sink.
>> >> >> > > I tried a sideoutput, but this is called in no correlation to the
>> >> >> > > writing to bigquery, so i don't know if it succeeded or failed.
>> >> >> > >
>> >> >> > >
>> >> >> > > any ideas?
>> >> >> > > chaim
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>


Re: writing status

2017-09-05 Thread Chaim Turkel
thanks, please keep me updated

On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Oh I see! Okay, this should be easy to fix. I'll take a look.
>
> On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> WriteResult does not support apply -> that is the problem
>>
>> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> <kirpic...@google.com.invalid> wrote:
>> > Hi,
>> >
>> > Sorry for the delay. So sounds like you want to do something after
>> writing
>> > a window of data to BigQuery is complete.
>> > I think this should be possible: expansion of BigQueryIO.write() returns
>> a
>> > WriteResult and you can apply other transforms to it. Have you tried
>> that?
>> >
>> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> I have documents from a mongo db that i need to migrate to bigquery.
>> >> Since it is mongodb i do not know they schema ahead of time, so i have
>> >> two pipelines, one to run over the documents and update the bigquery
>> >> schema, then wait a few minutes (i can take for bigquery to be able to
>> >> use the new schema) then with the other pipline copy all the
>> >> documents.
>> >> To know as to where i got with the different piplines i have a status
>> >> table so that at the start i know from where to continue.
>> >> So i need the option to update the status table with the success of
>> >> the copy and some time value of the last copied document
>> >>
>> >>
>> >> chaim
>> >>
>> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> >> <kirpic...@google.com.invalid> wrote:
>> >> > I'd like to know more about your both use cases, can you clarify? I
>> think
>> >> > making sinks output something that can be waited on by another
>> pipeline
>> >> > step is a reasonable request, but more details would help refine this
>> >> > suggestion.
>> >> >
>> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> chamik...@apache.org>
>> >> > wrote:
>> >> >
>> >> >> Can you do this from the program that runs the Beam job, after job is
>> >> >> complete (you might have to use a blocking runner or poll for the
>> >> status of
>> >> >> the job) ?
>> >> >>
>> >> >> - Cham
>> >> >>
>> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sniem...@apache.org>
>> >> wrote:
>> >> >>
>> >> >> > I also have a similar use case (but with BigTable) that I feel
>> like I
>> >> had
>> >> >> > to hack up to make work.  It'd be great to hear if there is a way
>> to
>> >> do
>> >> >> > something like this already, or if there are plans in the future.
>> >> >> >
>> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> > > Hi,
>> >> >> > >   I have a few piplines that are an ETL from different systems to
>> >> >> > bigquery.
>> >> >> > > I would like to write the status of the ETL after all records
>> have
>> >> >> > > been updated to the bigquery.
>> >> >> > > The problem is that writing to bigquery is a sink and you cannot
>> >> have
>> >> >> > > any other steps after the sink.
>> >> >> > > I tried a sideoutput, but this is called in no correlation to the
>> >> >> > > writing to bigquery, so i don't know if it succeeded or failed.
>> >> >> > >
>> >> >> > >
>> >> >> > > any ideas?
>> >> >> > > chaim
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>


Re: writing status

2017-09-04 Thread Chaim Turkel
WriteResult does not support apply -> that is the problem

On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> Hi,
>
> Sorry for the delay. So sounds like you want to do something after writing
> a window of data to BigQuery is complete.
> I think this should be possible: expansion of BigQueryIO.write() returns a
> WriteResult and you can apply other transforms to it. Have you tried that?
>
> On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> I have documents from a mongo db that i need to migrate to bigquery.
>> Since it is mongodb i do not know they schema ahead of time, so i have
>> two pipelines, one to run over the documents and update the bigquery
>> schema, then wait a few minutes (i can take for bigquery to be able to
>> use the new schema) then with the other pipline copy all the
>> documents.
>> To know as to where i got with the different piplines i have a status
>> table so that at the start i know from where to continue.
>> So i need the option to update the status table with the success of
>> the copy and some time value of the last copied document
>>
>>
>> chaim
>>
>> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> <kirpic...@google.com.invalid> wrote:
>> > I'd like to know more about your both use cases, can you clarify? I think
>> > making sinks output something that can be waited on by another pipeline
>> > step is a reasonable request, but more details would help refine this
>> > suggestion.
>> >
>> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <chamik...@apache.org>
>> > wrote:
>> >
>> >> Can you do this from the program that runs the Beam job, after job is
>> >> complete (you might have to use a blocking runner or poll for the
>> status of
>> >> the job) ?
>> >>
>> >> - Cham
>> >>
>> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sniem...@apache.org>
>> wrote:
>> >>
>> >> > I also have a similar use case (but with BigTable) that I feel like I
>> had
>> >> > to hack up to make work.  It'd be great to hear if there is a way to
>> do
>> >> > something like this already, or if there are plans in the future.
>> >> >
>> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> > > Hi,
>> >> > >   I have a few piplines that are an ETL from different systems to
>> >> > bigquery.
>> >> > > I would like to write the status of the ETL after all records have
>> >> > > been updated to the bigquery.
>> >> > > The problem is that writing to bigquery is a sink and you cannot
>> have
>> >> > > any other steps after the sink.
>> >> > > I tried a sideoutput, but this is called in no correlation to the
>> >> > > writing to bigquery, so i don't know if it succeeded or failed.
>> >> > >
>> >> > >
>> >> > > any ideas?
>> >> > > chaim
>> >> > >
>> >> >
>> >>
>>


Re: writing status

2017-08-26 Thread Chaim Turkel
I have documents from a mongo db that i need to migrate to bigquery.
Since it is mongodb i do not know they schema ahead of time, so i have
two pipelines, one to run over the documents and update the bigquery
schema, then wait a few minutes (i can take for bigquery to be able to
use the new schema) then with the other pipline copy all the
documents.
To know as to where i got with the different piplines i have a status
table so that at the start i know from where to continue.
So i need the option to update the status table with the success of
the copy and some time value of the last copied document


chaim

On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
> I'd like to know more about your both use cases, can you clarify? I think
> making sinks output something that can be waited on by another pipeline
> step is a reasonable request, but more details would help refine this
> suggestion.
>
> On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <chamik...@apache.org>
> wrote:
>
>> Can you do this from the program that runs the Beam job, after job is
>> complete (you might have to use a blocking runner or poll for the status of
>> the job) ?
>>
>> - Cham
>>
>> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sniem...@apache.org> wrote:
>>
>> > I also have a similar use case (but with BigTable) that I feel like I had
>> > to hack up to make work.  It'd be great to hear if there is a way to do
>> > something like this already, or if there are plans in the future.
>> >
>> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> > > Hi,
>> > >   I have a few piplines that are an ETL from different systems to
>> > bigquery.
>> > > I would like to write the status of the ETL after all records have
>> > > been updated to the bigquery.
>> > > The problem is that writing to bigquery is a sink and you cannot have
>> > > any other steps after the sink.
>> > > I tried a sideoutput, but this is called in no correlation to the
>> > > writing to bigquery, so i don't know if it succeeded or failed.
>> > >
>> > >
>> > > any ideas?
>> > > chaim
>> > >
>> >
>>


writing status

2017-08-25 Thread Chaim Turkel
Hi,
  I have a few piplines that are an ETL from different systems to bigquery.
I would like to write the status of the ETL after all records have
been updated to the bigquery.
The problem is that writing to bigquery is a sink and you cannot have
any other steps after the sink.
I tried a sideoutput, but this is called in no correlation to the
writing to bigquery, so i don't know if it succeeded or failed.


any ideas?
chaim


Re: Dynamically generate BigQuery schema based on input?

2017-08-24 Thread Chaim Turkel
it is not clear if you need dynamic destination or schema. For the
destination the to will work, but for schema you need to use
withSchemaFromView

On Thu, Aug 24, 2017 at 3:39 AM, Eugene Kirpichov
 wrote:
> Yes, this is possible using the BigQueryIO.write().to(DynamicDestinations)
> API. It allows you to write different values to different tables with
> different schemas.
> On Wed, Aug 23, 2017, 2:51 PM Asha Rostamianfar 
> wrote:
>
>> Hi,
>>
>> I'm wondering whether it's possible to dynamically generate a BigQuery
>> schema based on input. For instance, the fields would be specified in one
>> or more input files that are read and processed as part of the pipeline.
>>
>> Thanks,
>> Asha
>>