Are you certain that's happening Jim? Why? What happens if you just do
sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat
for gzip and the RDD wrapper around it already has the streaming
behaviour you wish for. but I could be wrong. Also, are you in pyspark or
scala Spark?
We've had success with Azkaban from LinkedIn over Oozie and Luigi.
http://azkaban.github.io/
Azkaban has support for many different job types, a fleshed out web UI with
decent log reporting, a decent failure / retry model, a REST api, and I
think support for multiple executor slaves is coming in
I want to avoid the small files problem when using Spark, without having to
manually calibrate a `repartition` at the end of each Spark application I
am writing, since the amount of data passing through sadly isn't all that
predictable. We're picking up from and writing data to HDFS.
I know other
Hello fellow Sparkians.
In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ,
Matei suggested that Spark might get deferred grouping and forced execution
of multiple jobs in an efficient way. His code sample:
rdd.reduceLater(reduceFunction1) // returns Future[ResultType1]
Brad: did you ever manage to figure this out? We're experiencing similar
problems, and have also found that the memory limitations supplied to the
Java side of PySpark don't limit how much memory Python can consume (which
makes sense).
Have you profiled the datasets you are trying to join? Is