Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE
Hi Mihail, Thanks for the code. I'm trying to reproduce the problem now. On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru vi...@informatik.hu-berlin.de wrote: Hi Max, thank you for your reply. I wanted to revise and dismiss all other factors before writing back. I've attached you my code and sample input data. I run the *APSPNaiveJob* using the following arguments: *0 100 hdfs://path/to/vertices-test-100 hdfs://path/to/edges-test-100 hdfs://path/to/tempgraph 10 0.5 hdfs://path/to/output-apsp 9* I was wrong, I originally thought that the first writeAsCsv call (line 50) doesn't work. An exception is thrown without the WriteMode.OVERWRITE when the file exists. But the problem lies with the second call (line 74), trying to write to the same path on HDFS. This issue is blocking me, because I need to persist the vertices dataset between iterations. Cheers, Mihail P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1. On 30.06.2015 16:51, Maximilian Michels wrote: HI Mihail, Thank you for your question. Do you have a short example that reproduces the problem? It is hard to find the cause without an error message or some example code. I wonder how your loop works without WriteMode.OVERWRITE because it should throw an exception in this case. Or do you change the file names on every write? Cheers, Max On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru vi...@informatik.hu-berlin.de wrote: I think my problem is related to a loop in my job. Before the loop, the writeAsCsv method works fine, even in overwrite mode. In the loop, in the first iteration, it writes an empty folder containing empty files to HDFS. Even though the DataSet it is supposed to write contains elements. Needless to say, this doesn't occur in a local execution environment, when writing to the local file system. I would appreciate any input on this. Best, Mihail On 30.06.2015 12:10, Mihail Vieru wrote: Hi Till, thank you for your reply. I have the following code snippet: *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, \n, ;, WriteMode.OVERWRITE);* When I remove the WriteMode parameter, it works. So I can reason that the DataSet contains data elements. Cheers, Mihail On 30.06.2015 12:06, Till Rohrmann wrote: Hi Mihail, have you checked that the DataSet you want to write to HDFS actually contains data elements? You can try calling collect which retrieves the data to your client to see what’s in there. Cheers, Till On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru vi...@informatik.hu-berlin.de wrote: Hi, the writeAsCsv method is not writing anything to HDFS (version 1.2.1) when the WriteMode is set to OVERWRITE. A file is created but it's empty. And no trace of errors in the Flink or Hadoop logs on all nodes in the cluster. What could cause this issue? I really really need this feature.. Best, Mihail
RE: How to cancel a Flink DataSource from the driver code?
Hi Stephan, I think that clean shutdown is a major feature to build a complex persistent service that use Flink Streaming for a data-quality critical task, and I’ll mark my code with a // FIXME comment waiting for this feature to be available ! Greetings, Arnaud De : ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] De la part de Stephan Ewen Envoyé : mercredi 1 juillet 2015 15:58 À : user@flink.apache.org Objet : Re: How to cancel a Flink DataSource from the driver code? Hi Arnaud! There is a pending issue and pull request that is adding a cancel() call to the command line interface. https://github.com/apache/flink/pull/750 It would be possible to extend that such that the driver can also cancel the program. Greetings, Stephan On Wed, Jul 1, 2015 at 3:33 PM, LINZ, Arnaud al...@bouyguestelecom.frmailto:al...@bouyguestelecom.fr wrote: Hello, I really looked in the documentation but unfortunately I could not find the answer: how do you cancel your data SourceFunction from your “driver” code (i.e., from a monitoring thread that can initiate a proper shutdown) ? Calling “cancel()” on the object passed to the addSource() has no effect since it does not apply to the marshalled distributed object(s). Best regards, Arnaud L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur. The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
Re: Data Source from Cassandra
Hi! If there is a CassandraSource for Hadoop, you can also use that with the HadoopInputFormatWrapper. If you want to implement a Flink-specific source, extending InputFormat is the right thing to do. A user has started to implement a cassandra sink in this fork (may be able to reuse some code or testing infrastructure): https://github.com/rzvoncek/flink/tree/zvo/cassandraSink Greetings, Stephan On Thu, Jul 2, 2015 at 11:34 AM, tambunanw if05...@gmail.com wrote: Hi All, I want to if there's a custom data source available for Cassandra ? From my observation seems that we need to implement that by extending InputFormat. Is there any guide on how to do this robustly ? Cheers -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-Source-from-Cassandra-tp1908.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Re: Data Source from Cassandra
Hi Stephan, Thanks a lot ! I will give it a look. Cheers On Thu, Jul 2, 2015 at 6:05 PM, Stephan Ewen se...@apache.org wrote: Hi! If there is a CassandraSource for Hadoop, you can also use that with the HadoopInputFormatWrapper. If you want to implement a Flink-specific source, extending InputFormat is the right thing to do. A user has started to implement a cassandra sink in this fork (may be able to reuse some code or testing infrastructure): https://github.com/rzvoncek/flink/tree/zvo/cassandraSink Greetings, Stephan On Thu, Jul 2, 2015 at 11:34 AM, tambunanw if05...@gmail.com wrote: Hi All, I want to if there's a custom data source available for Cassandra ? From my observation seems that we need to implement that by extending InputFormat. Is there any guide on how to do this robustly ? Cheers -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-Source-from-Cassandra-tp1908.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. -- Welly Tambunan Triplelands http://weltam.wordpress.com http://www.triplelands.com http://www.triplelands.com/blog/
Re: Flink 0.9 built with Scala 2.11
@Alexander I’m happy to hear that you want to help me. If you help me, I really appreciate. :) Regards, Chiwan Park On Jul 2, 2015, at 2:57 PM, Alexander Alexandrov alexander.s.alexand...@gmail.com wrote: @Chiwan: let me know if you need hands-on support. I'll be more then happy to help (as my downstream project is using Scala 2.11). 2015-07-01 17:43 GMT+02:00 Chiwan Park chiwanp...@apache.org: Okay, I will apply this suggestion. Regards, Chiwan Park On Jul 1, 2015, at 5:41 PM, Ufuk Celebi u...@apache.org wrote: On 01 Jul 2015, at 10:34, Stephan Ewen se...@apache.org wrote: +1, like that approach +1 I like that this is not breaking for non-Scala users :-)
Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE
The problem is that your input and output path are the same. Because Flink executes in a pipelined fashion, all the operators will come up at once. When you set WriteMode.OVERWRITE for the sink, it will delete the path before writing anything. That means that when your DataSource reads the input, there will be nothing to read from. Thus you get an empty DataSet which you write to HDFS again. Any further loops will then just write nothing. You can circumvent this problem, by prefixing every output file with a counter that you increment in your loop. Alternatively, if you only want to keep the latest output, you can use two files and let them alternate to be input and output. Let me know if you have any further questions. Kind regards, Max On Thu, Jul 2, 2015 at 10:20 AM, Maximilian Michels m...@apache.org wrote: Hi Mihail, Thanks for the code. I'm trying to reproduce the problem now. On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru vi...@informatik.hu-berlin.de wrote: Hi Max, thank you for your reply. I wanted to revise and dismiss all other factors before writing back. I've attached you my code and sample input data. I run the *APSPNaiveJob* using the following arguments: *0 100 hdfs://path/to/vertices-test-100 hdfs://path/to/edges-test-100 hdfs://path/to/tempgraph 10 0.5 hdfs://path/to/output-apsp 9* I was wrong, I originally thought that the first writeAsCsv call (line 50) doesn't work. An exception is thrown without the WriteMode.OVERWRITE when the file exists. But the problem lies with the second call (line 74), trying to write to the same path on HDFS. This issue is blocking me, because I need to persist the vertices dataset between iterations. Cheers, Mihail P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1. On 30.06.2015 16:51, Maximilian Michels wrote: HI Mihail, Thank you for your question. Do you have a short example that reproduces the problem? It is hard to find the cause without an error message or some example code. I wonder how your loop works without WriteMode.OVERWRITE because it should throw an exception in this case. Or do you change the file names on every write? Cheers, Max On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru vi...@informatik.hu-berlin.de wrote: I think my problem is related to a loop in my job. Before the loop, the writeAsCsv method works fine, even in overwrite mode. In the loop, in the first iteration, it writes an empty folder containing empty files to HDFS. Even though the DataSet it is supposed to write contains elements. Needless to say, this doesn't occur in a local execution environment, when writing to the local file system. I would appreciate any input on this. Best, Mihail On 30.06.2015 12:10, Mihail Vieru wrote: Hi Till, thank you for your reply. I have the following code snippet: *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, \n, ;, WriteMode.OVERWRITE);* When I remove the WriteMode parameter, it works. So I can reason that the DataSet contains data elements. Cheers, Mihail On 30.06.2015 12:06, Till Rohrmann wrote: Hi Mihail, have you checked that the DataSet you want to write to HDFS actually contains data elements? You can try calling collect which retrieves the data to your client to see what’s in there. Cheers, Till On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru vi...@informatik.hu-berlin.de wrote: Hi, the writeAsCsv method is not writing anything to HDFS (version 1.2.1) when the WriteMode is set to OVERWRITE. A file is created but it's empty. And no trace of errors in the Flink or Hadoop logs on all nodes in the cluster. What could cause this issue? I really really need this feature.. Best, Mihail
Re: Batch Processing as Streaming
Thanks Stephan, That's clear ! Cheers On Thu, Jul 2, 2015 at 6:13 PM, Stephan Ewen se...@apache.org wrote: Hi! I am actually working to get some more docs out there, there is a lack right now, agreed. Concerning your questions: (1) Batch programs basically recover from the data sources right now. Checkpointing as in the streaming case does not happen for batch programs. We have branches that materialize the intermediate streams and apply backtracking logic for batch programs, but they are not merged into the master at this point. (2) Streaming operators and user functions are long lived. They are started once and live to the end of the stream, or the machine failure. Greetings, Stephan On Thu, Jul 2, 2015 at 11:48 AM, tambunanw if05...@gmail.com wrote: Hi All, I see that the way batch processing works in Flink is quite different with Spark. It's all about using streaming engine in Flink. I have a couple of question 1. Is there any support on Checkpointing on batch processing also ? Or that's only for streaming 2. I want to ask about operator lifecyle ? is that short live or long live ? Any docs where i can read about this more ? Cheers -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Batch-Processing-as-Streaming-tp1909.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. -- Welly Tambunan Triplelands http://weltam.wordpress.com http://www.triplelands.com http://www.triplelands.com/blog/
POJO coCroup on null value
Hi to all, I'd like to join 2 datasets of POJO, let's say for example: Person: - name - birthPlaceId Place: - id - name I'd like to do people.coCoGroup(places).where(birthPlaceId).equalTo(id).with(...) However, not all people have a birthPlaceId value in my use case..so I get a NullPointer. Am I using the wrong operator for this? This is the stackTrace: java.lang.RuntimeException: A NullPointerException occured while accessing a key field in a POJO. Most likely, the value grouped/joined on is null. Field name: birthPlaceId at org.apache.flink.api.java.typeutils.runtime.PojoComparator.hash(PojoComparator.java:217) at org.apache.flink.runtime.operators.shipping.OutputEmitter.hashPartitionDefault(OutputEmitter.java:175) at org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:132) at org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:28) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:78) at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) Best, Flavio
Re: POJO coCroup on null value
Hi Flavio! Keys cannot be null in Flink, that is a contract deep in the system. Filter out the null valued elements, or, if you want them in the result, I would try to use a special value for null. That should do it. BTW: In SQL, joining on null usually filters out elements, as key operations on null are undefined. Greetings, Stephan On Thu, Jul 2, 2015 at 7:10 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, I'd like to join 2 datasets of POJO, let's say for example: Person: - name - birthPlaceId Place: - id - name I'd like to do people.coCoGroup(places).where(birthPlaceId).equalTo(id).with(...) However, not all people have a birthPlaceId value in my use case..so I get a NullPointer. Am I using the wrong operator for this? This is the stackTrace: java.lang.RuntimeException: A NullPointerException occured while accessing a key field in a POJO. Most likely, the value grouped/joined on is null. Field name: birthPlaceId at org.apache.flink.api.java.typeutils.runtime.PojoComparator.hash(PojoComparator.java:217) at org.apache.flink.runtime.operators.shipping.OutputEmitter.hashPartitionDefault(OutputEmitter.java:175) at org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:132) at org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:28) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:78) at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) Best, Flavio
Re: POJO coCroup on null value
ok, thanks for the help Stephan! On 2 Jul 2015 20:05, Stephan Ewen se...@apache.org wrote: Hi Flavio! Keys cannot be null in Flink, that is a contract deep in the system. Filter out the null valued elements, or, if you want them in the result, I would try to use a special value for null. That should do it. BTW: In SQL, joining on null usually filters out elements, as key operations on null are undefined. Greetings, Stephan On Thu, Jul 2, 2015 at 7:10 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, I'd like to join 2 datasets of POJO, let's say for example: Person: - name - birthPlaceId Place: - id - name I'd like to do people.coCoGroup(places).where(birthPlaceId).equalTo(id).with(...) However, not all people have a birthPlaceId value in my use case..so I get a NullPointer. Am I using the wrong operator for this? This is the stackTrace: java.lang.RuntimeException: A NullPointerException occured while accessing a key field in a POJO. Most likely, the value grouped/joined on is null. Field name: birthPlaceId at org.apache.flink.api.java.typeutils.runtime.PojoComparator.hash(PojoComparator.java:217) at org.apache.flink.runtime.operators.shipping.OutputEmitter.hashPartitionDefault(OutputEmitter.java:175) at org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:132) at org.apache.flink.runtime.operators.shipping.OutputEmitter.selectChannels(OutputEmitter.java:28) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:78) at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) Best, Flavio
TeraSort on Flink and Spark
Hello, I'd like to share my code for TeraSort on Flink and Spark which uses the same range partitioner as Hadoop TeraSort: https://github.com/eastcirclek/terasort I also write a short report on it: http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html In the blog post, I make a simple performance comparison between Flink, Spark, Tez, and MapReduce. I hope it will be helpful to you guys! Thanks. Dongwon Kim Postdoctoral Researcher @ Postech