Long running Spark Job Status on Remote Submission

2017-11-20 Thread Harsh Choudhary
Hi I am submitting a Spark Job on Yarn cluster from a remote machine which is not in the cluster itself. When there are some jobs which take some large time, the spark-submit process never exits as it still waits for the status of the job. Though on the cluster, the job gets finished

Re: Writing custom Structured Streaming receiver

2017-11-20 Thread nezhazheng
Hi Hien, You can write your own Source or Sink either through SPI(https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html ). Below in an example that implement kafka 0.8 source. https://github.com/jerryshao/spark-kafka-0-8-sql

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all, Any help? PFB. Thanks, Aakash. On 20-Nov-2017 6:58 PM, "Aakash Basu" wrote: > Hi all, > > I have a table which will have 4 columns - > > | Expression|filter_condition| from_clause| > group_by_columns| > > > This file may have variable

Re: Writing custom Structured Streaming receiver

2017-11-20 Thread Hien Luu
Hi TD, I looked at DataStreamReader class and looks like we can specify an FQCN as a source (provided that it implements trait Source). The DataSource.lookupDataSource function will try to load this FQCN during the creation of a DataSource object instance inside the DataStreamReader.load(). Will

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Shixiong(Ryan) Zhu
You are using Spark Streaming Kafka package. The correct package name is " org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0" On Mon, Nov 20, 2017 at 4:15 PM, salemi wrote: > Yes, we are using --packages > > $SPARK_HOME/bin/spark-submit --packages >

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread salemi
Yes, we are using --packages $SPARK_HOME/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 --py-files shell.py -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Holden Karau
What command did you use to launch your Spark application? The https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying documentation suggests using spark-submit with the `--packages` flag to include the required Kafka package. e.g. ./bin/spark-submit --packages

PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread salemi
Hi All, we are trying to use DataFrames approach with Kafka 0.10 and PySpark 2.2.0. We followed the instruction on the wiki https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. We coded something similar to the code below using Python: df = spark \ .read \

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread lucas.g...@gmail.com
That sounds like allot of work and if I understand you correctly it sounds like a piece of middleware that already exists (I could be wrong). Alluxio? Good luck and let us know how it goes! Gary On 20 November 2017 at 14:10, Jim Carroll wrote: > Thanks. In the meantime

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Jim Carroll
Thanks. In the meantime I might just write a custom file system that maps writes to parquet file parts to their final locations and then skips the move. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread lucas.g...@gmail.com
You can expect to see some fixes for this sort of issue in the medium term future (multiple months, probably not years). As Taylor notes, it's a Hadoop problem, not a spark problem. So whichever version of hadoop includes the fix will then wait for a spark release to get built against it. Last

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Tayler Lawrence Jones
It is an open issue with Hadoop file committer, not spark. The simple workaround is to write to hdfs then copy to s3. Netflix did a talk about their custom output committer at the last spark summit which is a clever efficient way of doing that - I’d check it out on YouTube. They have open sourced

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Jim Carroll
I have this exact issue. I was going to intercept the call in the filesystem if I had to (since we're using the S3 filesystem from Presto anyway) but if there's simply a way to do this correctly I'd much prefer it. This basically doubles the time to write parquet files to s3. -- Sent from:

Re: How to print plan of Structured Streaming DataFrame

2017-11-20 Thread Shixiong(Ryan) Zhu
-dev +user Which Spark version are you using? There is a bug in the old Spark. Try to use the latest version. In addition, you can call `query.explain()` as well. On Mon, Nov 20, 2017 at 4:00 AM, Chang Chen wrote: > Hi Guys > > I modified StructuredNetworkWordCount to

Re: Parquet files from spark not readable in Cascading

2017-11-20 Thread Vikas Gandham
I tried spark.sql.parquet.writeLegacyFormat to true but still issue persists. Thanks Vikas Gandham On Thu, Nov 16, 2017 at 10:25 AM, Yong Zhang wrote: > I don't have experience with Cascading, but we saw similar issue for > importing the data generated in Spark into Hive.

Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all, I have a table which will have 4 columns - | Expression|filter_condition| from_clause| group_by_columns| This file may have variable number of rows depending on the no. of KPIs I need to calculate. I need to write a SparkSQL program which will have to read this

Re: Kryo not registered class

2017-11-20 Thread Vadim Semenov
Try: Class.forName("[Lorg.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$SerializableFileStatus$SerializableBlockLocation;") On Sun, Nov 19, 2017 at 3:24 PM, Angel Francisco Orta < angel.francisco.o...@gmail.com> wrote: > Hello, I'm with spark 2.1.0 with scala and I'm