Why my project has this kind of error ?

2017-06-19 Thread 张明磊
Hello to all,


Below is my issue. I have already build again and reimport my project in 
IntelliJIDEA, but it still gives me this kind of error. But I can build without 
error by Maven. Just the IDEA gives me this error. Is there anyone know what 
happened with this ?




Thanks
Minglei

 

Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread kaniska Mandal
Thanks Tathagata for the pointer.

On Mon, Jun 19, 2017 at 8:24 PM, Tathagata Das 
wrote:

> That is not the write way to use watermark + append output mode. The
> `withWatermark` must be before the aggregation. Something like this.
>
> df.withWatermark("timestamp", "1 hour")
>   .groupBy(window("timestamp", "30 seconds"))
>   .agg(...)
>
> Read more here - https://databricks.com/blog/2017/05/08/event-time-
> aggregation-watermarking-apache-sparks-structured-streaming.html
>
>
> On Mon, Jun 19, 2017 at 7:27 PM, kaniska Mandal 
> wrote:
>
>> Hi Burak,
>>
>> Per your suggestion, I have specified
>> > deviceBasicAgg.withWatermark("eventtime", "30 seconds");
>> before invoking deviceBasicAgg.writeStream()...
>>
>> But I am still facing ~
>>
>> org.apache.spark.sql.AnalysisException: Append output mode not supported
>> when there are streaming aggregations on streaming DataFrames/DataSets;
>>
>> I am Ok with 'complete' output mode.
>>
>> =
>>
>> I tried another approach - Creating parquet file from the in-memory
>> dataset ~ which seems to work.
>>
>> But I need only the delta, not the cumulative count. Since 'append' mode
>> not supporting streaming Aggregation, I have to use 'complete'
>> outputMode.
>>
>> StreamingQuery streamingQry = deviceBasicAgg.writeStream()
>>
>>   .format("memory")
>>
>>   .trigger(ProcessingTime.create("5 seconds"))
>>
>>   .queryName("deviceBasicAggSummary")
>>
>>   .outputMode("complete")
>>
>>   .option("checkpointLocation", "/tmp/parquet/checkpoints/")
>>
>>   .start();
>>
>> streamingQry.awaitTermination();
>>
>> Thread.sleep(5000);
>>
>> while(true) {
>>
>> Dataset deviceBasicAggSummaryData = spark.table("deviceBasicAggSum
>> mary");
>>
>> deviceBasicAggSummaryData.toDF().write().parquet("/data/summary/devices/"
>> +new Date().getTime()+"/");
>>
>> }
>>
>> =
>>
>> So whats the best practice for 'low latency query on distributed data'
>> using Spark SQL and Structured Streaming ?
>>
>>
>> Thanks
>>
>> Kaniska
>>
>>
>>
>> On Mon, Jun 19, 2017 at 11:55 AM, Burak Yavuz  wrote:
>>
>>> Hi Kaniska,
>>>
>>> In order to use append mode with aggregations, you need to set an event
>>> time watermark (using `withWatermark`). Otherwise, Spark doesn't know when
>>> to output an aggregation result as "final".
>>>
>>> Best,
>>> Burak
>>>
>>> On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal <
>>> kaniska.man...@gmail.com> wrote:
>>>
 Hi,

 My goal is to ~
 (1) either chain streaming aggregations in a single query OR
 (2) run multiple streaming aggregations and save data in some
 meaningful format to execute low latency / failsafe OLAP queries

 So my first choice is parquet format , but I failed to make it work !

 I am using spark-streaming_2.11-2.1.1

 I am facing the following error -
 org.apache.spark.sql.AnalysisException: Append output mode not
 supported when there are streaming aggregations on streaming
 DataFrames/DataSets;

 - for the following syntax

  StreamingQuery streamingQry = tagBasicAgg.writeStream()

   .format("parquet")

   .trigger(ProcessingTime.create("10 seconds"))

   .queryName("tagAggSummary")

   .outputMode("append")

   .option("checkpointLocation", "/tmp/summary/checkpoints/"
 )

   .option("path", "/data/summary/tags/")

   .start();
 But, parquet doesn't support 'complete' outputMode.

 So is parquet supported only for batch queries , NOT for streaming
 queries ?

 - note that console outputmode working fine !

 Any help will be much appreciated.

 Thanks
 Kaniska


>>>
>>
>


Do we anything for Deep Learning in Spark?

2017-06-19 Thread Gaurav1809
Hi All,

Similar to how we have machine learning library called ML, do we have
anything for deep learning?
If yes, please share the details. If not then what should be the approach?

Thanks and regards,
Gaurav Pandya



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-anything-for-Deep-Learning-in-Spark-tp28772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread Felix Cheung
And perhaps the error message can be improved here?


From: Tathagata Das 
Sent: Monday, June 19, 2017 8:24:01 PM
To: kaniska Mandal
Cc: Burak Yavuz; user
Subject: Re: How save streaming aggregations on 'Structured Streams' in parquet 
format ?

That is not the write way to use watermark + append output mode. The 
`withWatermark` must be before the aggregation. Something like this.

df.withWatermark("timestamp", "1 hour")
  .groupBy(window("timestamp", "30 seconds"))
  .agg(...)

Read more here - 
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html


On Mon, Jun 19, 2017 at 7:27 PM, kaniska Mandal 
> wrote:
Hi Burak,

Per your suggestion, I have specified
> deviceBasicAgg.withWatermark("eventtime", "30 seconds");
before invoking deviceBasicAgg.writeStream()...

But I am still facing ~

org.apache.spark.sql.AnalysisException: Append output mode not supported when 
there are streaming aggregations on streaming DataFrames/DataSets;

I am Ok with 'complete' output mode.

=

I tried another approach - Creating parquet file from the in-memory dataset ~ 
which seems to work.

But I need only the delta, not the cumulative count. Since 'append' mode not 
supporting streaming Aggregation, I have to use 'complete' outputMode.

StreamingQuery streamingQry = deviceBasicAgg.writeStream()

  .format("memory")

  .trigger(ProcessingTime.create("5 seconds"))

  .queryName("deviceBasicAggSummary")

  .outputMode("complete")

  .option("checkpointLocation", "/tmp/parquet/checkpoints/")

  .start();

streamingQry.awaitTermination();

Thread.sleep(5000);

while(true) {

Dataset deviceBasicAggSummaryData = spark.table("deviceBasicAggSummary");

deviceBasicAggSummaryData.toDF().write().parquet("/data/summary/devices/"+new 
Date().getTime()+"/");

}

=

So whats the best practice for 'low latency query on distributed data' using 
Spark SQL and Structured Streaming ?


Thanks

Kaniska


On Mon, Jun 19, 2017 at 11:55 AM, Burak Yavuz 
> wrote:
Hi Kaniska,

In order to use append mode with aggregations, you need to set an event time 
watermark (using `withWatermark`). Otherwise, Spark doesn't know when to output 
an aggregation result as "final".

Best,
Burak

On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal 
> wrote:
Hi,

My goal is to ~
(1) either chain streaming aggregations in a single query OR
(2) run multiple streaming aggregations and save data in some meaningful format 
to execute low latency / failsafe OLAP queries

So my first choice is parquet format , but I failed to make it work !

I am using spark-streaming_2.11-2.1.1

I am facing the following error -
org.apache.spark.sql.AnalysisException: Append output mode not supported when 
there are streaming aggregations on streaming DataFrames/DataSets;

- for the following syntax

 StreamingQuery streamingQry = tagBasicAgg.writeStream()

  .format("parquet")

  .trigger(ProcessingTime.create("10 seconds"))

  .queryName("tagAggSummary")

  .outputMode("append")

  .option("checkpointLocation", "/tmp/summary/checkpoints/")

  .option("path", "/data/summary/tags/")

  .start();

But, parquet doesn't support 'complete' outputMode.

So is parquet supported only for batch queries , NOT for streaming queries ?

- note that console outputmode working fine !

Any help will be much appreciated.

Thanks
Kaniska






Spark Streaming - Increasing number of executors slows down processing rate

2017-06-19 Thread Mal Edwin
Hi All,
I am struggling with an odd issue and would like your help in addressing it.

Environment
AWS Cluster (40 Spark Nodes & 4 node Kafka cluster)
Spark Kafka Streaming submitted in Yarn cluster mode
Kafka - Single topic, 400 partitions
Spark 2.1 on Cloudera
Kafka 10.0 on Cloudera

We have zero messages in Kafka and starting this spark job with 100 Executors 
each with 14GB of RAM and single executor core.
The time to process 0 records(end of each batch) is 5seconds

When we increase the executors to 400 and everything else remains the same 
except we reduce memory to 11GB, we see the time to process 0 records(end of 
each batch) increases 10times to  50Second and some cases it goes to 103 
seconds.

Spark Streaming configs that we are setting are
Batchwindow = 60 seconds
Backpressure.enabled = true
spark.memory.fraction=0.3 (we store more data in our own data structures)
spark.streaming.kafka.consumer.poll.ms=1

Have tried increasing driver memory to 4GB and also increased driver.cores to 4.

If anybody has faced similar issues please provide some pointers to how to 
address this issue.

Thanks a lot for your time.

Regards,
Edwin



Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread Tathagata Das
That is not the write way to use watermark + append output mode. The
`withWatermark` must be before the aggregation. Something like this.

df.withWatermark("timestamp", "1 hour")
  .groupBy(window("timestamp", "30 seconds"))
  .agg(...)

Read more here -
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html


On Mon, Jun 19, 2017 at 7:27 PM, kaniska Mandal 
wrote:

> Hi Burak,
>
> Per your suggestion, I have specified
> > deviceBasicAgg.withWatermark("eventtime", "30 seconds");
> before invoking deviceBasicAgg.writeStream()...
>
> But I am still facing ~
>
> org.apache.spark.sql.AnalysisException: Append output mode not supported
> when there are streaming aggregations on streaming DataFrames/DataSets;
>
> I am Ok with 'complete' output mode.
>
> =
>
> I tried another approach - Creating parquet file from the in-memory
> dataset ~ which seems to work.
>
> But I need only the delta, not the cumulative count. Since 'append' mode
> not supporting streaming Aggregation, I have to use 'complete' outputMode.
>
> StreamingQuery streamingQry = deviceBasicAgg.writeStream()
>
>   .format("memory")
>
>   .trigger(ProcessingTime.create("5 seconds"))
>
>   .queryName("deviceBasicAggSummary")
>
>   .outputMode("complete")
>
>   .option("checkpointLocation", "/tmp/parquet/checkpoints/")
>
>   .start();
>
> streamingQry.awaitTermination();
>
> Thread.sleep(5000);
>
> while(true) {
>
> Dataset deviceBasicAggSummaryData = spark.table("
> deviceBasicAggSummary");
>
> deviceBasicAggSummaryData.toDF().write().parquet("/data/summary/devices/"+
> new Date().getTime()+"/");
>
> }
>
> =
>
> So whats the best practice for 'low latency query on distributed data'
> using Spark SQL and Structured Streaming ?
>
>
> Thanks
>
> Kaniska
>
>
>
> On Mon, Jun 19, 2017 at 11:55 AM, Burak Yavuz  wrote:
>
>> Hi Kaniska,
>>
>> In order to use append mode with aggregations, you need to set an event
>> time watermark (using `withWatermark`). Otherwise, Spark doesn't know when
>> to output an aggregation result as "final".
>>
>> Best,
>> Burak
>>
>> On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal <
>> kaniska.man...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> My goal is to ~
>>> (1) either chain streaming aggregations in a single query OR
>>> (2) run multiple streaming aggregations and save data in some meaningful
>>> format to execute low latency / failsafe OLAP queries
>>>
>>> So my first choice is parquet format , but I failed to make it work !
>>>
>>> I am using spark-streaming_2.11-2.1.1
>>>
>>> I am facing the following error -
>>> org.apache.spark.sql.AnalysisException: Append output mode not
>>> supported when there are streaming aggregations on streaming
>>> DataFrames/DataSets;
>>>
>>> - for the following syntax
>>>
>>>  StreamingQuery streamingQry = tagBasicAgg.writeStream()
>>>
>>>   .format("parquet")
>>>
>>>   .trigger(ProcessingTime.create("10 seconds"))
>>>
>>>   .queryName("tagAggSummary")
>>>
>>>   .outputMode("append")
>>>
>>>   .option("checkpointLocation", "/tmp/summary/checkpoints/")
>>>
>>>   .option("path", "/data/summary/tags/")
>>>
>>>   .start();
>>> But, parquet doesn't support 'complete' outputMode.
>>>
>>> So is parquet supported only for batch queries , NOT for streaming
>>> queries ?
>>>
>>> - note that console outputmode working fine !
>>>
>>> Any help will be much appreciated.
>>>
>>> Thanks
>>> Kaniska
>>>
>>>
>>
>


Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread kaniska Mandal
Hi Burak,

Per your suggestion, I have specified
> deviceBasicAgg.withWatermark("eventtime", "30 seconds");
before invoking deviceBasicAgg.writeStream()...

But I am still facing ~

org.apache.spark.sql.AnalysisException: Append output mode not supported
when there are streaming aggregations on streaming DataFrames/DataSets;

I am Ok with 'complete' output mode.

=

I tried another approach - Creating parquet file from the in-memory dataset
~ which seems to work.

But I need only the delta, not the cumulative count. Since 'append' mode
not supporting streaming Aggregation, I have to use 'complete' outputMode.

StreamingQuery streamingQry = deviceBasicAgg.writeStream()

  .format("memory")

  .trigger(ProcessingTime.create("5 seconds"))

  .queryName("deviceBasicAggSummary")

  .outputMode("complete")

  .option("checkpointLocation", "/tmp/parquet/checkpoints/")

  .start();

streamingQry.awaitTermination();

Thread.sleep(5000);

while(true) {

Dataset deviceBasicAggSummaryData = spark.table("deviceBasicAggSummary"
);

deviceBasicAggSummaryData.toDF().write().parquet("/data/summary/devices/"+
new Date().getTime()+"/");

}

=

So whats the best practice for 'low latency query on distributed data'
using Spark SQL and Structured Streaming ?


Thanks

Kaniska



On Mon, Jun 19, 2017 at 11:55 AM, Burak Yavuz  wrote:

> Hi Kaniska,
>
> In order to use append mode with aggregations, you need to set an event
> time watermark (using `withWatermark`). Otherwise, Spark doesn't know when
> to output an aggregation result as "final".
>
> Best,
> Burak
>
> On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal  > wrote:
>
>> Hi,
>>
>> My goal is to ~
>> (1) either chain streaming aggregations in a single query OR
>> (2) run multiple streaming aggregations and save data in some meaningful
>> format to execute low latency / failsafe OLAP queries
>>
>> So my first choice is parquet format , but I failed to make it work !
>>
>> I am using spark-streaming_2.11-2.1.1
>>
>> I am facing the following error -
>> org.apache.spark.sql.AnalysisException: Append output mode not supported
>> when there are streaming aggregations on streaming DataFrames/DataSets;
>>
>> - for the following syntax
>>
>>  StreamingQuery streamingQry = tagBasicAgg.writeStream()
>>
>>   .format("parquet")
>>
>>   .trigger(ProcessingTime.create("10 seconds"))
>>
>>   .queryName("tagAggSummary")
>>
>>   .outputMode("append")
>>
>>   .option("checkpointLocation", "/tmp/summary/checkpoints/")
>>
>>   .option("path", "/data/summary/tags/")
>>
>>   .start();
>> But, parquet doesn't support 'complete' outputMode.
>>
>> So is parquet supported only for batch queries , NOT for streaming
>> queries ?
>>
>> - note that console outputmode working fine !
>>
>> Any help will be much appreciated.
>>
>> Thanks
>> Kaniska
>>
>>
>


Unsubscribe

2017-06-19 Thread praba karan
Unsubscribe

Sent from Yahoo Mail on Android

the meaning of partition column and bucket column please?

2017-06-19 Thread ??????????
Hi all,
The code of Column has member named isPartition and isBucket.


What is the meanibg of them please?
And when should set them as true please?


Thank you advanced.


Fei Shao

Re: the scheme in stream reader

2017-06-19 Thread ??????????
Hi ,
I have submitted a JIRA for this issue.
The link is 
https://issues.apache.org/jira/browse/SPARK-21147

thanks 
Fei Shao
 
---Original---
From: "Michael Armbrust"
Date: 2017/6/20 03:06:49
To: "??"<1427357...@qq.com>;
Cc: "user";"dev";
Subject: Re: the scheme in stream reader


The socket source can't know how to parse your data.  I think the right thing 
would be for it to throw an exception saying that you can't set the schema 
here.  Would you mind opening a JIRA ticket?

If you are trying to parse data from something like JSON then you should use 
from_json` on the value returned.


On Sun, Jun 18, 2017 at 12:27 AM, ?? <1427357...@qq.com> wrote:
Hi all,


L set the scheme for  DataStreamReader but when I print the scheme.It just 
printed:
root
|--value:string (nullable=true)


My code is


val line = ss.readStream.format("socket")
.option("ip",xxx)
.option("port",xxx)
.scheme(StructField("name",StringType??::(StructField("age", IntegerType))).load
line.printSchema


My spark version is 2.1.0.
I want the printSchema prints the schema I set in the code.How should I do 
please?
And my original target is the received data from socket is handled as schema 
directly.What should I do please?


thanks
Fei Shao

Flume DStream produces 0 records after HDFS node killed

2017-06-19 Thread N B
Hi all,

We are running a Standalone Spark Cluster for running a streaming
application. The application consumes data from Flume using a Flume Polling
stream created as such :

flumeStream = FlumeUtils.createPollingStream(streamingContext,
socketAddress.toArray(new InetSocketAddress[socketAddress.size()]),
StorageLevel.MEMORY_AND_DISK_SER(), *100*, *5*);


The checkpoint directory is configured to be on an HDFS cluster and Spark
workers have their SPARK_LOCAL_DIRS and SPARK_WORKER_DIR defined to be on
their respective local filesystems.

What we are seeing is some odd behavior and unable to explain. During
normal operation, everything runs as expected with flume delivering events
to Spark. However, while running, if I kill one of the HDFS nodes (does not
matter which one), the Flume Receiver in Spark stops producing any data to
the data processing.

I enabled debug logging for org.apache.spark.streaming.flume on Spark
worker nodes and looked at the logs for the one that gets to run the Flume
Receiver and it keeps chugging along receiving data from Flume as shown in
a sample of the log below, but the resulting batches in the Stream start
receiving 0 records soon as the HDFS node is killed, with no errors being
produced to indicate any issue.

*17/06/20 01:05:42 DEBUG FlumeBatchFetcher: Ack sent for sequence number:
09fa05f59050*
*17/06/20 01:05:44 DEBUG FlumeBatchFetcher: Received batch of 100 events
with sequence number: 09fa05f59052*
*17/06/20 01:05:44 DEBUG FlumeBatchFetcher: Sending ack for sequence
number: 09fa05f59052*
*17/06/20 01:05:44 DEBUG FlumeBatchFetcher: Ack sent for sequence number:
09fa05f59052*
*17/06/20 01:05:47 DEBUG FlumeBatchFetcher: Received batch of 100 events
with sequence number: 09fa05f59054*
*17/06/20 01:05:47 DEBUG FlumeBatchFetcher: Sending ack for sequence
number: 09fa05f59054*

The driver output for the application shows (printed via
Dstream.count().map().print()):

---
Time: 149792077 ms
---
Received 0 flume events.


Any insights about where to look in order to find the root cause will be
greatly appreciated.

Thanks
N B


Merging multiple Pandas dataframes

2017-06-19 Thread saatvikshah1994
Hi, 

I am iteratively receiving a file which can only be opened as a Pandas
dataframe. For the first such file I receive, I am converting this to a
Spark dataframe using the 'createDataframe' utility function. The next file
onward, I am converting it and union'ing it into the first Spark
dataframe(the schema always stays the same). After each union, I am
persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted
all such files to a single spark dataframe I am coalescing it. Following
some tips from this Stack Overflow
post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions).
   

Any suggestions for optimizing this process further?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Error while doing mvn release for spark 2.0.2 using scala 2.10

2017-06-19 Thread Kanagha Kumar
Thanks. But, I am required to do a maven release to Nexus on spark 2.0.2
built against scala 2.10.
How can I go about with this? Is this a bug that I need to open in Spark
jira?

On Mon, Jun 19, 2017 at 12:12 PM, Shixiong(Ryan) Zhu <
shixi...@databricks.com> wrote:

> Some of projects (such as spark-tags) are Java projects. Spark doesn't
> fix the artifact name and just hard-core 2.11.
>
> For your issue, try to use `install` rather than `package`.
>
> On Sat, Jun 17, 2017 at 7:20 PM, Kanagha Kumar 
> wrote:
>
>> Hi,
>>
>> Bumping up again! Why does spark modules depend upon scala2.11 versions
>> inspite of changing pom.xmls using ./dev/change-scala-version.sh 2.10.
>> Appreciate any quick help!!
>>
>> Thanks
>>
>> On Fri, Jun 16, 2017 at 2:59 PM, Kanagha Kumar 
>> wrote:
>>
>>> Hey all,
>>>
>>>
>>> I'm trying to use Spark 2.0.2 with scala 2.10 by following this
>>> https://spark.apache.org/docs/2.0.2/building-spark.html
>>> #building-for-scala-210
>>>
>>> ./dev/change-scala-version.sh 2.10
>>> ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
>>>
>>>
>>> I could build the distribution successfully using
>>> bash -xv dev/make-distribution.sh --tgz  -Dscala-2.10 -DskipTests
>>>
>>> But, when I am trying to maven release, it keeps failing with the error
>>> using the command:
>>>
>>>
>>> Executing Maven:  -B -f pom.xml  -DscmCommentPrefix=[maven-release-plugin]
>>> -e  -Dscala-2.10 -Pyarn -Phadoop-2.7 -Phadoop-provided -DskipTests
>>> -Dresume=false -U -X *release:prepare release:perform*
>>>
>>> Failed to execute goal on project spark-sketch_2.10: Could not resolve
>>> dependencies for project 
>>> org.apache.spark:spark-sketch_2.10:jar:2.0.2-sfdc-3.0.0:
>>> *Failure to find org.apache.spark:spark-tags_2.11:jar:2.0.2-sfdc-3.0.0*
>>> in  was cached in the local repository, resolution will
>>> not be reattempted until the update interval of nexus has elapsed or
>>> updates are forced - [Help 1]
>>>
>>>
>>> Why does spark-sketch depend upon spark-tags_2.11 when I have already
>>> compiled against scala 2.10?? Any pointers would be helpful.
>>> Thanks
>>> Kanagha
>>>
>>
>>
>


Re: how many topics spark streaming can handle

2017-06-19 Thread Bryan Jeffrey
Hello Ashok, 




We're consuming from more than 10 topics in some Spark streaming applications. 
Topic management is a concern (what is read from where, etc), but I have seen 
no issues from Spark itself. 




Regards, 




Bryan Jeffrey 




Get Outlook for Android







On Mon, Jun 19, 2017 at 3:24 PM -0400, "Ashok Kumar" 
 wrote:










thank you
in the following example
   val topics = "test1,test2,test3"
    val brokers = "localhost:9092"
    val topicsSet = topics.split(",").toSet
    val sparkConf = new 
SparkConf().setAppName("KafkaDroneCalc").setMaster("local") 
//spark://localhost:7077
    val sc = new SparkContext(sparkConf)
    val ssc = new StreamingContext(sc, Seconds(30))
    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder] (ssc, kafkaParams, topicsSet)
  it is possible to have three topics or many topics?

 

On Monday, 19 June 2017, 20:10, Michael Armbrust  
wrote:
  

 I don't think that there is really a Spark specific limit here.  It would be a 
function of the size of your spark / kafka clusters and the type of processing 
you are trying to do.
On Mon, Jun 19, 2017 at 12:00 PM, Ashok Kumar  
wrote:
  Hi Gurus,
Within one Spark streaming process how many topics can be handled? I have not 
tried more than one topic.
Thanks


 






Re: how many topics spark streaming can handle

2017-06-19 Thread Ashok Kumar
thank you
in the following example
   val topics = "test1,test2,test3"
    val brokers = "localhost:9092"
    val topicsSet = topics.split(",").toSet
    val sparkConf = new 
SparkConf().setAppName("KafkaDroneCalc").setMaster("local") 
//spark://localhost:7077
    val sc = new SparkContext(sparkConf)
    val ssc = new StreamingContext(sc, Seconds(30))
    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder] (ssc, kafkaParams, topicsSet)
  it is possible to have three topics or many topics?

 

On Monday, 19 June 2017, 20:10, Michael Armbrust  
wrote:
 

 I don't think that there is really a Spark specific limit here.  It would be a 
function of the size of your spark / kafka clusters and the type of processing 
you are trying to do.
On Mon, Jun 19, 2017 at 12:00 PM, Ashok Kumar  
wrote:

 Hi Gurus,
Within one Spark streaming process how many topics can be handled? I have not 
tried more than one topic.
Thanks



   

Re: Error while doing mvn release for spark 2.0.2 using scala 2.10

2017-06-19 Thread Shixiong(Ryan) Zhu
Some of projects (such as spark-tags) are Java projects. Spark doesn't fix
the artifact name and just hard-core 2.11.

For your issue, try to use `install` rather than `package`.

On Sat, Jun 17, 2017 at 7:20 PM, Kanagha Kumar 
wrote:

> Hi,
>
> Bumping up again! Why does spark modules depend upon scala2.11 versions
> inspite of changing pom.xmls using ./dev/change-scala-version.sh 2.10.
> Appreciate any quick help!!
>
> Thanks
>
> On Fri, Jun 16, 2017 at 2:59 PM, Kanagha Kumar 
> wrote:
>
>> Hey all,
>>
>>
>> I'm trying to use Spark 2.0.2 with scala 2.10 by following this
>> https://spark.apache.org/docs/2.0.2/building-spark.html
>> #building-for-scala-210
>>
>> ./dev/change-scala-version.sh 2.10
>> ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
>>
>>
>> I could build the distribution successfully using
>> bash -xv dev/make-distribution.sh --tgz  -Dscala-2.10 -DskipTests
>>
>> But, when I am trying to maven release, it keeps failing with the error
>> using the command:
>>
>>
>> Executing Maven:  -B -f pom.xml  -DscmCommentPrefix=[maven-release-plugin]
>> -e  -Dscala-2.10 -Pyarn -Phadoop-2.7 -Phadoop-provided -DskipTests
>> -Dresume=false -U -X *release:prepare release:perform*
>>
>> Failed to execute goal on project spark-sketch_2.10: Could not resolve
>> dependencies for project 
>> org.apache.spark:spark-sketch_2.10:jar:2.0.2-sfdc-3.0.0:
>> *Failure to find org.apache.spark:spark-tags_2.11:jar:2.0.2-sfdc-3.0.0*
>> in  was cached in the local repository, resolution will
>> not be reattempted until the update interval of nexus has elapsed or
>> updates are forced - [Help 1]
>>
>>
>> Why does spark-sketch depend upon spark-tags_2.11 when I have already
>> compiled against scala 2.10?? Any pointers would be helpful.
>> Thanks
>> Kanagha
>>
>
>


Re: how many topics spark streaming can handle

2017-06-19 Thread Michael Armbrust
I don't think that there is really a Spark specific limit here.  It would
be a function of the size of your spark / kafka clusters and the type of
processing you are trying to do.

On Mon, Jun 19, 2017 at 12:00 PM, Ashok Kumar 
wrote:

> Hi Gurus,
>
> Within one Spark streaming process how many topics can be handled? I have
> not tried more than one topic.
>
> Thanks
>


Re: the scheme in stream reader

2017-06-19 Thread Michael Armbrust
The socket source can't know how to parse your data.  I think the right
thing would be for it to throw an exception saying that you can't set the
schema here.  Would you mind opening a JIRA ticket?

If you are trying to parse data from something like JSON then you should
use from_json` on the value returned.

On Sun, Jun 18, 2017 at 12:27 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> L set the scheme for  DataStreamReader but when I print the scheme.It just
> printed:
> root
> |--value:string (nullable=true)
>
> My code is
>
> val line = ss.readStream.format("socket")
> .option("ip",xxx)
> .option("port",xxx)
> .scheme(StructField("name",StringType)::(StructField("age",
> IntegerType))).load
> line.printSchema
>
> My spark version is 2.1.0.
> I want the printSchema prints the schema I set in the code.How should I do
> please?
> And my original target is the received data from socket is handled as
> schema directly.What should I do please?
>
> thanks
> Fei Shao
>
>
>
>
>
>
>


how many topics spark streaming can handle

2017-06-19 Thread Ashok Kumar
 Hi Gurus,
Within one Spark streaming process how many topics can be handled? I have not 
tried more than one topic.
Thanks

Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread Burak Yavuz
Hi Kaniska,

In order to use append mode with aggregations, you need to set an event
time watermark (using `withWatermark`). Otherwise, Spark doesn't know when
to output an aggregation result as "final".

Best,
Burak

On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal 
wrote:

> Hi,
>
> My goal is to ~
> (1) either chain streaming aggregations in a single query OR
> (2) run multiple streaming aggregations and save data in some meaningful
> format to execute low latency / failsafe OLAP queries
>
> So my first choice is parquet format , but I failed to make it work !
>
> I am using spark-streaming_2.11-2.1.1
>
> I am facing the following error -
> org.apache.spark.sql.AnalysisException: Append output mode not supported
> when there are streaming aggregations on streaming DataFrames/DataSets;
>
> - for the following syntax
>
>  StreamingQuery streamingQry = tagBasicAgg.writeStream()
>
>   .format("parquet")
>
>   .trigger(ProcessingTime.create("10 seconds"))
>
>   .queryName("tagAggSummary")
>
>   .outputMode("append")
>
>   .option("checkpointLocation", "/tmp/summary/checkpoints/")
>
>   .option("path", "/data/summary/tags/")
>
>   .start();
> But, parquet doesn't support 'complete' outputMode.
>
> So is parquet supported only for batch queries , NOT for streaming queries
> ?
>
> - note that console outputmode working fine !
>
> Any help will be much appreciated.
>
> Thanks
> Kaniska
>
>


Re: Spark-Kafka integration - build failing with sbt

2017-06-19 Thread Cody Koeninger
org.apache.spark.streaming.kafka.KafkaUtils
is in the
spark-streaming-kafka-0-8
project

On Mon, Jun 19, 2017 at 1:01 PM, karan alang  wrote:
> Hi Cody - i do have a additional basic question ..
>
> When i tried to compile the code in Eclipse, i was not able to do that
>
> eg.
> import org.apache.spark.streaming.kafka.KafkaUtils
>
> gave errors saying KafaUtils was not part of the package.
> However, when i used sbt to compile - the compilation went through fine
>
> So, I assume additional libraries are being downloaded when i provide the
> appropriate packages in LibraryDependencies ?
> which ones would have helped compile this ?
>
>
>
> On Sat, Jun 17, 2017 at 2:53 PM, karan alang  wrote:
>>
>> Thanks, Cody .. yes, was able to fix that.
>>
>> On Sat, Jun 17, 2017 at 1:18 PM, Cody Koeninger 
>> wrote:
>>>
>>> There are different projects for different versions of kafka,
>>> spark-streaming-kafka-0-8 and spark-streaming-kafka-0-10
>>>
>>> See
>>>
>>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>>>
>>> On Fri, Jun 16, 2017 at 6:51 PM, karan alang 
>>> wrote:
>>> > I'm trying to compile kafka & Spark Streaming integration code i.e.
>>> > reading
>>> > from Kafka using Spark Streaming,
>>> >   and the sbt build is failing with error -
>>> >
>>> >   [error] (*:update) sbt.ResolveException: unresolved dependency:
>>> > org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
>>> >
>>> >   Scala version -> 2.10.7
>>> >   Spark Version -> 2.1.0
>>> >   Kafka version -> 0.9
>>> >   sbt version -> 0.13
>>> >
>>> > Contents of sbt files is as shown below ->
>>> >
>>> > 1)
>>> >   vi spark_kafka_code/project/plugins.sbt
>>> >
>>> >   addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
>>> >
>>> >  2)
>>> >   vi spark_kafka_code/sparkkafka.sbt
>>> >
>>> > import AssemblyKeys._
>>> > assemblySettings
>>> >
>>> > name := "SparkKafka Project"
>>> >
>>> > version := "1.0"
>>> > scalaVersion := "2.11.7"
>>> >
>>> > val sparkVers = "2.1.0"
>>> >
>>> > // Base Spark-provided dependencies
>>> > libraryDependencies ++= Seq(
>>> >   "org.apache.spark" %% "spark-core" % sparkVers % "provided",
>>> >   "org.apache.spark" %% "spark-streaming" % sparkVers % "provided",
>>> >   "org.apache.spark" %% "spark-streaming-kafka" % sparkVers)
>>> >
>>> > mergeStrategy in assembly := {
>>> >   case m if m.toLowerCase.endsWith("manifest.mf") =>
>>> > MergeStrategy.discard
>>> >   case m if m.toLowerCase.startsWith("META-INF")  =>
>>> > MergeStrategy.discard
>>> >   case "reference.conf"   =>
>>> > MergeStrategy.concat
>>> >   case m if m.endsWith("UnusedStubClass.class")   =>
>>> > MergeStrategy.discard
>>> >   case _ => MergeStrategy.first
>>> > }
>>> >
>>> >   i launch sbt, and then try to create an eclipse project, complete
>>> > error is
>>> > as shown below -
>>> >
>>> >   -
>>> >
>>> >   sbt
>>> > [info] Loading global plugins from /Users/karanalang/.sbt/0.13/plugins
>>> > [info] Loading project definition from
>>> >
>>> > /Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/project
>>> > [info] Set current project to SparkKafka Project (in build
>>> >
>>> > file:/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/)
>>> >> eclipse
>>> > [info] About to create Eclipse project files for your project(s).
>>> > [info] Updating
>>> >
>>> > {file:/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/}spark_kafka_code...
>>> > [info] Resolving org.apache.spark#spark-streaming-kafka_2.11;2.1.0 ...
>>> > [warn] module not found:
>>> > org.apache.spark#spark-streaming-kafka_2.11;2.1.0
>>> > [warn]  local: tried
>>> > [warn]
>>> >
>>> > /Users/karanalang/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
>>> > [warn]  activator-launcher-local: tried
>>> > [warn]
>>> >
>>> > /Users/karanalang/.activator/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
>>> > [warn]  activator-local: tried
>>> > [warn]
>>> >
>>> > /Users/karanalang/Documents/Technology/SCALA/activator-dist-1.3.10/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
>>> > [warn]  public: tried
>>> > [warn]
>>> >
>>> > https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
>>> > [warn]  typesafe-releases: tried
>>> > [warn]
>>> >
>>> > http://repo.typesafe.com/typesafe/releases/org/apache/spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
>>> > [warn]  typesafe-ivy-releasez: tried
>>> > [warn]
>>> >
>>> > http://repo.typesafe.com/typesafe/ivy-releases/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
>>> > [info] Resolving jline#jline;2.12.1 ...
>>> > [warn] ::
>>> > [warn] ::  UNRESOLVED 

How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread kaniska Mandal
Hi,

My goal is to ~
(1) either chain streaming aggregations in a single query OR
(2) run multiple streaming aggregations and save data in some meaningful
format to execute low latency / failsafe OLAP queries

So my first choice is parquet format , but I failed to make it work !

I am using spark-streaming_2.11-2.1.1

I am facing the following error -
org.apache.spark.sql.AnalysisException: Append output mode not supported
when there are streaming aggregations on streaming DataFrames/DataSets;

- for the following syntax

 StreamingQuery streamingQry = tagBasicAgg.writeStream()

  .format("parquet")

  .trigger(ProcessingTime.create("10 seconds"))

  .queryName("tagAggSummary")

  .outputMode("append")

  .option("checkpointLocation", "/tmp/summary/checkpoints/")

  .option("path", "/data/summary/tags/")

  .start();
But, parquet doesn't support 'complete' outputMode.

So is parquet supported only for batch queries , NOT for streaming queries
?

- note that console outputmode working fine !

Any help will be much appreciated.

Thanks
Kaniska


Re: Spark-Kafka integration - build failing with sbt

2017-06-19 Thread karan alang
Hi Cody - i do have a additional basic question ..

When i tried to compile the code in Eclipse, i was not able to do that

eg.
import org.apache.spark.streaming.kafka.KafkaUtils

gave errors saying KafaUtils was not part of the package.
However, when i used sbt to compile - the compilation went through fine

So, I assume additional libraries are being downloaded when i provide the
appropriate packages in LibraryDependencies ?
which ones would have helped compile this ?



On Sat, Jun 17, 2017 at 2:53 PM, karan alang  wrote:

> Thanks, Cody .. yes, was able to fix that.
>
> On Sat, Jun 17, 2017 at 1:18 PM, Cody Koeninger 
> wrote:
>
>> There are different projects for different versions of kafka,
>> spark-streaming-kafka-0-8 and spark-streaming-kafka-0-10
>>
>> See
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>>
>> On Fri, Jun 16, 2017 at 6:51 PM, karan alang 
>> wrote:
>> > I'm trying to compile kafka & Spark Streaming integration code i.e.
>> reading
>> > from Kafka using Spark Streaming,
>> >   and the sbt build is failing with error -
>> >
>> >   [error] (*:update) sbt.ResolveException: unresolved dependency:
>> > org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
>> >
>> >   Scala version -> 2.10.7
>> >   Spark Version -> 2.1.0
>> >   Kafka version -> 0.9
>> >   sbt version -> 0.13
>> >
>> > Contents of sbt files is as shown below ->
>> >
>> > 1)
>> >   vi spark_kafka_code/project/plugins.sbt
>> >
>> >   addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
>> >
>> >  2)
>> >   vi spark_kafka_code/sparkkafka.sbt
>> >
>> > import AssemblyKeys._
>> > assemblySettings
>> >
>> > name := "SparkKafka Project"
>> >
>> > version := "1.0"
>> > scalaVersion := "2.11.7"
>> >
>> > val sparkVers = "2.1.0"
>> >
>> > // Base Spark-provided dependencies
>> > libraryDependencies ++= Seq(
>> >   "org.apache.spark" %% "spark-core" % sparkVers % "provided",
>> >   "org.apache.spark" %% "spark-streaming" % sparkVers % "provided",
>> >   "org.apache.spark" %% "spark-streaming-kafka" % sparkVers)
>> >
>> > mergeStrategy in assembly := {
>> >   case m if m.toLowerCase.endsWith("manifest.mf") =>
>> MergeStrategy.discard
>> >   case m if m.toLowerCase.startsWith("META-INF")  =>
>> MergeStrategy.discard
>> >   case "reference.conf"   =>
>> MergeStrategy.concat
>> >   case m if m.endsWith("UnusedStubClass.class")   =>
>> MergeStrategy.discard
>> >   case _ => MergeStrategy.first
>> > }
>> >
>> >   i launch sbt, and then try to create an eclipse project, complete
>> error is
>> > as shown below -
>> >
>> >   -
>> >
>> >   sbt
>> > [info] Loading global plugins from /Users/karanalang/.sbt/0.13/plugins
>> > [info] Loading project definition from
>> > /Users/karanalang/Documents/Technology/Coursera_spark_scala/
>> spark_kafka_code/project
>> > [info] Set current project to SparkKafka Project (in build
>> > file:/Users/karanalang/Documents/Technology/Coursera_spark_
>> scala/spark_kafka_code/)
>> >> eclipse
>> > [info] About to create Eclipse project files for your project(s).
>> > [info] Updating
>> > {file:/Users/karanalang/Documents/Technology/Coursera_spark_
>> scala/spark_kafka_code/}spark_kafka_code...
>> > [info] Resolving org.apache.spark#spark-streaming-kafka_2.11;2.1.0 ...
>> > [warn] module not found:
>> > org.apache.spark#spark-streaming-kafka_2.11;2.1.0
>> > [warn]  local: tried
>> > [warn]
>> > /Users/karanalang/.ivy2/local/org.apache.spark/spark-streami
>> ng-kafka_2.11/2.1.0/ivys/ivy.xml
>> > [warn]  activator-launcher-local: tried
>> > [warn]
>> > /Users/karanalang/.activator/repository/org.apache.spark/spa
>> rk-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
>> > [warn]  activator-local: tried
>> > [warn]
>> > /Users/karanalang/Documents/Technology/SCALA/activator-dist-
>> 1.3.10/repository/org.apache.spark/spark-streaming-kafka_2.
>> 11/2.1.0/ivys/ivy.xml
>> > [warn]  public: tried
>> > [warn]
>> > https://repo1.maven.org/maven2/org/apache/spark/spark-stream
>> ing-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
>> > [warn]  typesafe-releases: tried
>> > [warn]
>> > http://repo.typesafe.com/typesafe/releases/org/apache/spark/
>> spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
>> > [warn]  typesafe-ivy-releasez: tried
>> > [warn]
>> > http://repo.typesafe.com/typesafe/ivy-releases/org.apache.
>> spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
>> > [info] Resolving jline#jline;2.12.1 ...
>> > [warn] ::
>> > [warn] ::  UNRESOLVED DEPENDENCIES ::
>> > [warn] ::
>> > [warn] :: org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not
>> found
>> > [warn] ::
>> > [warn]
>> > [warn] Note: Unresolved dependencies path:
>> > [warn] 

spark submit with logs and kerberos

2017-06-19 Thread Juan Pablo Briganti
Hi!

I have a question about logs and have not seen the answer through internet.
I have a spark submit process and I configure a custom log configuration to
it using the next params:
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=customlog4j.properties"

--driver-java-options '-Dlog4j.configuration=customlog4j.properties'
So far is working great, but now I have to add kerberos params like this:
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true
--keytab 
--principal 
When I add those params (and in particular, --principal) my custom log file
starts being ignored and the default spark log file is being used insted.
Change the default spark configuration file or path is a little difficult
for us since we can't access that cluster.
Is there any way to keep using my custom log configuration the same way? I
am doing something wrong?

Thanks for the help!





-- 
*Juan Pablo Briganti* | Data Architect
*GLOBANT* | AR: +54 11 4109 1700 ext. 19508 | US: +1 877 215 5230 ext. 19508
|
[image: Facebook]  [image: Twitter]
 [image: Youtube]
 [image: Linkedin]
 [image: Pinterest]
 [image: Globant] 

-- 


The information contained in this e-mail may be confidential. It has been 
sent for the sole use of the intended recipient(s). If the reader of this 
message is not an intended recipient, you are hereby notified that any 
unauthorized review, use, disclosure, dissemination, distribution or 
copying of this communication, or any of its contents, 
is strictly prohibited. If you have received it by mistake please let us 
know by e-mail immediately and delete it from your system. Many thanks.

 

La información contenida en este mensaje puede ser confidencial. Ha sido 
enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de 
este mensaje no fuera el destinatario previsto, por el presente queda Ud. 
notificado que cualquier lectura, uso, publicación, diseminación, 
distribución o copiado de esta comunicación o su contenido está 
estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje 
por error le agradeceremos notificarnos por e-mail inmediatamente y 
eliminarlo de su sistema. Muchas gracias.



Stream Processing: how to refresh a loaded dataset periodically

2017-06-19 Thread aravias
Hi,
we are using  structured streaming for stream processing and for each
message to do some data enrichment i have to lookup data from cassandra and
that data in cassandra  gets periodically (like once in a day) updated.
I want to look at the option of loading it as a dataset and then register it
as a temp table and query the required data from that and avoid making a
call to cassandra for each message. But, not sure how to refresh the loaded
data from cassandra periodically since its gets updated once in a day?  I am
looking for any suggestions here, thanks.

regards
Arvind



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Stream-Processing-how-to-refresh-a-loaded-dataset-periodically-tp28769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark streaming data loss

2017-06-19 Thread vasanth kumar
Hi,

I have spark kafka streaming job running in Yarn cluster mode with
spark.task.maxFailures=4 (default)
spark.yarn.max.executor.failures=8
number of executor=1
spark.streaming.stopGracefullyOnShutdown=false
checkpointing enabled


- When there is RuntimeException in a batch in executor then same batch
retired 4 times and moving to next batch. Likewise it moves to many batch
and later executor is failing. Executor receives the shutdown after few
seconds. Driver and executor is killed.
- Then driver and executor relaunching with very high offset than failed
executor last offset used.

I expected executor fails after a batch fails 4 times and relaunch the new
executor with same failed batch.

Driver creating stages with new batch range after previous batch fails 4
times. How to stop create new task in executor? How to avoid such data loss?

Spark version: 1.6.1


-- 
Regards
Vasanth kumar RJ


Does spark support hive table(parquet) column renaming?

2017-06-19 Thread 李斌松
Does spark support hive table(parquet) column renaming?