Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-07-02 Thread Maximilian Michels
Hi Mihail,

Thanks for the code. I'm trying to reproduce the problem now.

On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru vi...@informatik.hu-berlin.de
wrote:

  Hi Max,

 thank you for your reply. I wanted to revise and dismiss all other factors
 before writing back. I've attached you my code and sample input data.

 I run the *APSPNaiveJob* using the following arguments:

 *0 100 hdfs://path/to/vertices-test-100 hdfs://path/to/edges-test-100
 hdfs://path/to/tempgraph 10 0.5 hdfs://path/to/output-apsp 9*

 I was wrong, I originally thought that the first writeAsCsv call (line 50)
 doesn't work. An exception is thrown without the WriteMode.OVERWRITE when
 the file exists.

 But the problem lies with the second call (line 74), trying to write to
 the same path on HDFS.

 This issue is blocking me, because I need to persist the vertices dataset
 between iterations.

 Cheers,
 Mihail

 P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1.



 On 30.06.2015 16:51, Maximilian Michels wrote:

   HI Mihail,

  Thank you for your question. Do you have a short example that reproduces
 the problem? It is hard to find the cause without an error message or some
 example code.

  I wonder how your loop works without WriteMode.OVERWRITE because it
 should throw an exception in this case. Or do you change the file names on
 every write?

  Cheers,
  Max

 On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru 
 vi...@informatik.hu-berlin.de wrote:

  I think my problem is related to a loop in my job.

 Before the loop, the writeAsCsv method works fine, even in overwrite mode.

 In the loop, in the first iteration, it writes an empty folder containing
 empty files to HDFS. Even though the DataSet it is supposed to write
 contains elements.

 Needless to say, this doesn't occur in a local execution environment,
 when writing to the local file system.


 I would appreciate any input on this.

 Best,
 Mihail



 On 30.06.2015 12:10, Mihail Vieru wrote:

 Hi Till,

 thank you for your reply.

 I have the following code snippet:

 *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, \n,
 ;, WriteMode.OVERWRITE);*

 When I remove the WriteMode parameter, it works. So I can reason that the
 DataSet contains data elements.

 Cheers,
 Mihail


 On 30.06.2015 12:06, Till Rohrmann wrote:

  Hi Mihail,

 have you checked that the DataSet you want to write to HDFS actually
 contains data elements? You can try calling collect which retrieves the
 data to your client to see what’s in there.

 Cheers,
 Till
 ​

 On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru 
 vi...@informatik.hu-berlin.de wrote:

 Hi,

 the writeAsCsv method is not writing anything to HDFS (version 1.2.1)
 when the WriteMode is set to OVERWRITE.
 A file is created but it's empty. And no trace of errors in the Flink or
 Hadoop logs on all nodes in the cluster.

 What could cause this issue? I really really need this feature..

 Best,
 Mihail









RE: How to cancel a Flink DataSource from the driver code?

2015-07-02 Thread LINZ, Arnaud
Hi Stephan,

I think that clean shutdown is a major feature to build a complex persistent 
service that use Flink Streaming for a data-quality critical task, and I’ll 
mark my code with a // FIXME comment  waiting for this feature to be available !

Greetings,
Arnaud



De : ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] De la part de Stephan 
Ewen
Envoyé : mercredi 1 juillet 2015 15:58
À : user@flink.apache.org
Objet : Re: How to cancel a Flink DataSource from the driver code?

Hi Arnaud!

There is a pending issue and pull request that is adding a cancel() call to 
the command line interface.

https://github.com/apache/flink/pull/750

It would be possible to extend that such that the driver can also cancel the 
program.

Greetings,
Stephan


On Wed, Jul 1, 2015 at 3:33 PM, LINZ, Arnaud 
al...@bouyguestelecom.frmailto:al...@bouyguestelecom.fr wrote:
Hello,

I really looked in the documentation but unfortunately I could not find the 
answer: how do you cancel your data SourceFunction from your “driver” code 
(i.e., from a monitoring thread that can initiate a proper shutdown) ? Calling 
“cancel()” on the object passed to the addSource() has no effect since it does 
not apply to the marshalled distributed object(s).

Best regards,
Arnaud





L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.



Re: Data Source from Cassandra

2015-07-02 Thread Stephan Ewen
Hi!

If there is a CassandraSource for Hadoop, you can also use that with the
HadoopInputFormatWrapper.

If you want to implement a Flink-specific source, extending InputFormat is
the right thing to do. A user has started to implement a cassandra sink in
this fork (may be able to reuse some code or testing infrastructure):
https://github.com/rzvoncek/flink/tree/zvo/cassandraSink

Greetings,
Stephan





On Thu, Jul 2, 2015 at 11:34 AM, tambunanw if05...@gmail.com wrote:

 Hi All,

 I want to if there's a custom data source available for Cassandra ?

 From my observation seems that we need to implement that by extending
 InputFormat. Is there any guide on how to do this robustly ?


 Cheers



 --
 View this message in context:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-Source-from-Cassandra-tp1908.html
 Sent from the Apache Flink User Mailing List archive. mailing list archive
 at Nabble.com.



Re: Data Source from Cassandra

2015-07-02 Thread Welly Tambunan
Hi Stephan,

Thanks a lot !

I will give it a look.


Cheers

On Thu, Jul 2, 2015 at 6:05 PM, Stephan Ewen se...@apache.org wrote:

 Hi!

 If there is a CassandraSource for Hadoop, you can also use that with the
 HadoopInputFormatWrapper.

 If you want to implement a Flink-specific source, extending InputFormat is
 the right thing to do. A user has started to implement a cassandra sink in
 this fork (may be able to reuse some code or testing infrastructure):
 https://github.com/rzvoncek/flink/tree/zvo/cassandraSink

 Greetings,
 Stephan





 On Thu, Jul 2, 2015 at 11:34 AM, tambunanw if05...@gmail.com wrote:

 Hi All,

 I want to if there's a custom data source available for Cassandra ?

 From my observation seems that we need to implement that by extending
 InputFormat. Is there any guide on how to do this robustly ?


 Cheers



 --
 View this message in context:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-Source-from-Cassandra-tp1908.html
 Sent from the Apache Flink User Mailing List archive. mailing list
 archive at Nabble.com.





-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com http://www.triplelands.com/blog/


Re: Flink 0.9 built with Scala 2.11

2015-07-02 Thread Chiwan Park
@Alexander I’m happy to hear that you want to help me. If you help me, I really 
appreciate. :)

Regards,
Chiwan Park


 On Jul 2, 2015, at 2:57 PM, Alexander Alexandrov 
 alexander.s.alexand...@gmail.com wrote:
 
 @Chiwan: let me know if you need hands-on support. I'll be more then happy to 
 help (as my downstream project is using Scala 2.11).
 
 2015-07-01 17:43 GMT+02:00 Chiwan Park chiwanp...@apache.org:
 Okay, I will apply this suggestion.
 
 Regards,
 Chiwan Park
 
  On Jul 1, 2015, at 5:41 PM, Ufuk Celebi u...@apache.org wrote:
 
 
  On 01 Jul 2015, at 10:34, Stephan Ewen se...@apache.org wrote:
 
  +1, like that approach
 
  +1
 
  I like that this is not breaking for non-Scala users :-)
 
 
 
 



Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-07-02 Thread Maximilian Michels
The problem is that your input and output path are the same. Because Flink
executes in a pipelined fashion, all the operators will come up at once.
When you set WriteMode.OVERWRITE for the sink, it will delete the path
before writing anything. That means that when your DataSource reads the
input, there will be nothing to read from. Thus you get an empty DataSet
which you write to HDFS again. Any further loops will then just write
nothing.

You can circumvent this problem, by prefixing every output file with a
counter that you increment in your loop. Alternatively, if you only want to
keep the latest output, you can use two files and let them alternate to be
input and output.

Let me know if you have any further questions.

Kind regards,
Max

On Thu, Jul 2, 2015 at 10:20 AM, Maximilian Michels m...@apache.org wrote:

 Hi Mihail,

 Thanks for the code. I'm trying to reproduce the problem now.

 On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru 
 vi...@informatik.hu-berlin.de wrote:

  Hi Max,

 thank you for your reply. I wanted to revise and dismiss all other
 factors before writing back. I've attached you my code and sample input
 data.

 I run the *APSPNaiveJob* using the following arguments:

 *0 100 hdfs://path/to/vertices-test-100 hdfs://path/to/edges-test-100
 hdfs://path/to/tempgraph 10 0.5 hdfs://path/to/output-apsp 9*

 I was wrong, I originally thought that the first writeAsCsv call (line
 50) doesn't work. An exception is thrown without the WriteMode.OVERWRITE
 when the file exists.

 But the problem lies with the second call (line 74), trying to write to
 the same path on HDFS.

 This issue is blocking me, because I need to persist the vertices dataset
 between iterations.

 Cheers,
 Mihail

 P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1.



 On 30.06.2015 16:51, Maximilian Michels wrote:

   HI Mihail,

  Thank you for your question. Do you have a short example that reproduces
 the problem? It is hard to find the cause without an error message or some
 example code.

  I wonder how your loop works without WriteMode.OVERWRITE because it
 should throw an exception in this case. Or do you change the file names on
 every write?

  Cheers,
  Max

 On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru 
 vi...@informatik.hu-berlin.de wrote:

  I think my problem is related to a loop in my job.

 Before the loop, the writeAsCsv method works fine, even in overwrite
 mode.

 In the loop, in the first iteration, it writes an empty folder
 containing empty files to HDFS. Even though the DataSet it is supposed to
 write contains elements.

 Needless to say, this doesn't occur in a local execution environment,
 when writing to the local file system.


 I would appreciate any input on this.

 Best,
 Mihail



 On 30.06.2015 12:10, Mihail Vieru wrote:

 Hi Till,

 thank you for your reply.

 I have the following code snippet:

 *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, \n,
 ;, WriteMode.OVERWRITE);*

 When I remove the WriteMode parameter, it works. So I can reason that
 the DataSet contains data elements.

 Cheers,
 Mihail


 On 30.06.2015 12:06, Till Rohrmann wrote:

  Hi Mihail,

 have you checked that the DataSet you want to write to HDFS actually
 contains data elements? You can try calling collect which retrieves the
 data to your client to see what’s in there.

 Cheers,
 Till
 ​

 On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru 
 vi...@informatik.hu-berlin.de wrote:

 Hi,

 the writeAsCsv method is not writing anything to HDFS (version 1.2.1)
 when the WriteMode is set to OVERWRITE.
 A file is created but it's empty. And no trace of errors in the Flink
 or Hadoop logs on all nodes in the cluster.

 What could cause this issue? I really really need this feature..

 Best,
 Mihail










Re: Batch Processing as Streaming

2015-07-02 Thread Welly Tambunan
Thanks Stephan,

That's clear !

Cheers

On Thu, Jul 2, 2015 at 6:13 PM, Stephan Ewen se...@apache.org wrote:

 Hi!

 I am actually working to get some more docs out there, there is a lack
 right now, agreed.

 Concerning your questions:

 (1) Batch programs basically recover from the data sources right now.
 Checkpointing as in the streaming case does not happen for batch programs.
 We have branches that materialize the intermediate streams and apply
 backtracking logic for batch programs, but they are not merged into the
 master at this point.

 (2) Streaming operators and user functions are long lived. They are
 started once and live to the end of the stream, or the machine failure.

 Greetings,
 Stephan


 On Thu, Jul 2, 2015 at 11:48 AM, tambunanw if05...@gmail.com wrote:

 Hi All,

 I see that the way batch processing works in Flink is quite different with
 Spark. It's all about using streaming engine in Flink.

 I have a couple of question

 1. Is there any support on Checkpointing on batch processing also ? Or
 that's only for streaming

 2. I want to ask about operator lifecyle ? is that short live or long
 live ?
 Any docs where i can read about this more ?


 Cheers



 --
 View this message in context:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Batch-Processing-as-Streaming-tp1909.html
 Sent from the Apache Flink User Mailing List archive. mailing list
 archive at Nabble.com.





-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com http://www.triplelands.com/blog/


POJO coCroup on null value

2015-07-02 Thread Flavio Pompermaier
Hi to all,

I'd like to join 2 datasets of POJO, let's say for example:

Person:
 - name
 - birthPlaceId

Place:
 - id
 - name

I'd like to do
people.coCoGroup(places).where(birthPlaceId).equalTo(id).with(...)

However, not all people have a birthPlaceId value in my use case..so I get
a NullPointer.
Am I using the wrong operator for this?
This is the stackTrace:

java.lang.RuntimeException: A NullPointerException occured while accessing
a key field in a POJO. Most likely, the value grouped/joined on is null.
Field name: birthPlaceId
at
org.apache.flink.api.java.typeutils.runtime.PojoComparator.hash(PojoComparator.java:217)
at
org.apache.flink.runtime.operators.shipping.OutputEmitter.hashPartitionDefault(OutputEmitter.java:175)
at
org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:132)
at
org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:28)
at
org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:78)
at
org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)

Best,
Flavio


Re: POJO coCroup on null value

2015-07-02 Thread Stephan Ewen
Hi Flavio!

Keys cannot be null in Flink, that is a contract deep in the system.

Filter out the null valued elements, or, if you want them in the result, I
would try to use a special value for null. That should do it.

BTW: In SQL, joining on null usually filters out elements, as key
operations on null are undefined.

Greetings,
Stephan


On Thu, Jul 2, 2015 at 7:10 PM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 Hi to all,

 I'd like to join 2 datasets of POJO, let's say for example:

 Person:
  - name
  - birthPlaceId

 Place:
  - id
  - name

 I'd like to do
 people.coCoGroup(places).where(birthPlaceId).equalTo(id).with(...)

 However, not all people have a birthPlaceId value in my use case..so I get
 a NullPointer.
 Am I using the wrong operator for this?
 This is the stackTrace:

 java.lang.RuntimeException: A NullPointerException occured while accessing
 a key field in a POJO. Most likely, the value grouped/joined on is null.
 Field name: birthPlaceId
 at
 org.apache.flink.api.java.typeutils.runtime.PojoComparator.hash(PojoComparator.java:217)
 at
 org.apache.flink.runtime.operators.shipping.OutputEmitter.hashPartitionDefault(OutputEmitter.java:175)
 at
 org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:132)
 at
 org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:28)
 at
 org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:78)
 at
 org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)

 Best,
 Flavio



Re: POJO coCroup on null value

2015-07-02 Thread Flavio Pompermaier
ok, thanks for the help Stephan!
On 2 Jul 2015 20:05, Stephan Ewen se...@apache.org wrote:

 Hi Flavio!

 Keys cannot be null in Flink, that is a contract deep in the system.

 Filter out the null valued elements, or, if you want them in the result, I
 would try to use a special value for null. That should do it.

 BTW: In SQL, joining on null usually filters out elements, as key
 operations on null are undefined.

 Greetings,
 Stephan


 On Thu, Jul 2, 2015 at 7:10 PM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 Hi to all,

 I'd like to join 2 datasets of POJO, let's say for example:

 Person:
  - name
  - birthPlaceId

 Place:
  - id
  - name

 I'd like to do
 people.coCoGroup(places).where(birthPlaceId).equalTo(id).with(...)

 However, not all people have a birthPlaceId value in my use case..so I
 get a NullPointer.
 Am I using the wrong operator for this?
 This is the stackTrace:

 java.lang.RuntimeException: A NullPointerException occured while
 accessing a key field in a POJO. Most likely, the value grouped/joined on
 is null. Field name: birthPlaceId
 at
 org.apache.flink.api.java.typeutils.runtime.PojoComparator.hash(PojoComparator.java:217)
 at
 org.apache.flink.runtime.operators.shipping.OutputEmitter.hashPartitionDefault(OutputEmitter.java:175)
 at
 org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:132)
 at
 org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:28)
 at
 org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:78)
 at
 org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)

 Best,
 Flavio





TeraSort on Flink and Spark

2015-07-02 Thread Dongwon Kim
Hello,

I'd like to share my code for TeraSort on Flink and Spark which uses
the same range partitioner as Hadoop TeraSort:
https://github.com/eastcirclek/terasort

I also write a short report on it:
http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html
In the blog post, I make a simple performance comparison between
Flink, Spark, Tez, and MapReduce.

I hope it will be helpful to you guys!
Thanks.

Dongwon Kim
Postdoctoral Researcher @ Postech