Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-20 Thread kant kodali
Hi All,

Does Spark Structured Streaming have a JDBC sink or Do I need to use
ForEachWriter? I see the following code in this link

and
I can see that database name can be passed in the connection string,
however, I wonder how to pass a table name?

inputDF.groupBy($"action", window($"time", "1 hour")).count()
   .writeStream.format("jdbc")
   .save("jdbc:mysql//…")


Thanks,
Kant


Re: Lag and queued up batches info in Structured Streaming UI

2018-06-20 Thread Tathagata Das
Also, you can get information about the last progress made (input rates,
etc.) from StreamingQuery.lastProgress, StreamingQuery.recentProgress, and
using StreamingQueryListener.
Its all documented -
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries

On Wed, Jun 20, 2018 at 6:06 PM, Tathagata Das 
wrote:

> Structured Streaming does not maintain a queue of batch like DStream.
> DStreams used to cut off batches at a fixed interval and put in a queue,
> and a different thread processed queued batches. In contrast, Structured
> Streaming simply cuts off and immediately processes a batch after the
> previous batch finishes. So the question about queue size and lag does not
> apply to Structured Streaming.
>
> That said, there is no UI for Structured Streaming. You can see the sql
> plans for each micro-batch in the SQL tab.
>
>
>
>
>
> On Wed, Jun 20, 2018 at 12:12 PM, SRK  wrote:
>
>> hi,
>>
>> How do we get information like lag and queued up batches in Structured
>> streaming? Following api does not seem to give any info about  lag and
>> queued up batches similar to DStreams.
>>
>> https://spark.apache.org/docs/2.2.1/api/java/org/apache/spar
>> k/streaming/scheduler/BatchInfo.html
>>
>> Thanks!
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Lag and queued up batches info in Structured Streaming UI

2018-06-20 Thread Tathagata Das
Structured Streaming does not maintain a queue of batch like DStream.
DStreams used to cut off batches at a fixed interval and put in a queue,
and a different thread processed queued batches. In contrast, Structured
Streaming simply cuts off and immediately processes a batch after the
previous batch finishes. So the question about queue size and lag does not
apply to Structured Streaming.

That said, there is no UI for Structured Streaming. You can see the sql
plans for each micro-batch in the SQL tab.





On Wed, Jun 20, 2018 at 12:12 PM, SRK  wrote:

> hi,
>
> How do we get information like lag and queued up batches in Structured
> streaming? Following api does not seem to give any info about  lag and
> queued up batches similar to DStreams.
>
> https://spark.apache.org/docs/2.2.1/api/java/org/apache/
> spark/streaming/scheduler/BatchInfo.html
>
> Thanks!
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread Daniel Pires
Thanks for coming back with the solution!

Sorry my suggestion did not help


Daniel

On Wed, 20 Jun 2018, 21:46 mattl156,  wrote:

> Alright so I figured it out.
>
> When reading from and writing to Hive metastore Parquet tables, Spark SQL
> will try to use its own Parquet support instead of Hive SerDe for better
> performance.
>
> And so setting things like the below have no impact.
>
> sqlContext.setConf("mapred.input.dir.recursive","true");
> sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
>
> The Default Parquet Support also does not read recursive directories.
>
> The using of spark default serde vs hive serde is controlled by:
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by
> default
>
> After setting this to false "
> sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "false") "
>
> I can set the other 2 properties above and query my parquet table with sub
> directories.
>
> -Matt
>
>
>
>
>
> --
> Sent from:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_=DwICAg=T4LuzJg_R6QwRnqJoo4xTCUXoKbdWTdhZj7r4OYEklY=jiwJwiY6eWNA-UciMYI1Iw=vDKanQpl6_Oxjiexu1ybjELUTvIsAHVt69RaZ9GKFx4=g6UtmolMtdTJCjOWQgptVrav_W7Ona0LD3sHAWrVsw4=
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread mattl156
Alright so I figured it out.

When reading from and writing to Hive metastore Parquet tables, Spark SQL
will try to use its own Parquet support instead of Hive SerDe for better
performance. 

And so setting things like the below have no impact. 

sqlContext.setConf("mapred.input.dir.recursive","true");
sqlContext.setConf("spark.sql.parquet.binaryAsString", "true") 

The Default Parquet Support also does not read recursive directories. 

The using of spark default serde vs hive serde is controlled by: 
spark.sql.hive.convertMetastoreParquet configuration, and is turned on by
default

After setting this to false "
sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "false") " 

I can set the other 2 properties above and query my parquet table with sub
directories.

-Matt





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Lag and queued up batches info in Structured Streaming UI

2018-06-20 Thread SRK
hi,

How do we get information like lag and queued up batches in Structured
streaming? Following api does not seem to give any info about  lag and
queued up batches similar to DStreams.

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/streaming/scheduler/BatchInfo.html

Thanks!




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Blockmgr directories intermittently not being cleaned up

2018-06-20 Thread tBoyle
I'm experiencing the same behaviour with shuffle data being orphaned on disk
(Spark 2.0.1 with Spark streaming).

We are using AWS R4 EC2 instances with 300GB EBS volumes attached, most
spilled shuffle data is eventually cleaned up by the ContextCleaner within
10 minutes. We do not use the external shuffle service and also use mesos. 

Occasionally some shuffle files are never removed until the application is
gracefully shutdown or dies due to lack of disk space. I am confident the
orphaned shuffle data is not in use by any jobs after 5 minutes (batch
duration). Did you know of any possible causes of this shuffle data not
being cleaned and left orphaned on the disk?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Kazuaki Ishizaki
If it is difficult to create the small stand alone program, another 
approach seems to attach everything (i.e. configuration, data, program, 
console output, log, history server data, etc.)
As a log, the community would recommend the info log with 
"spark.sql.codegen.logging.maxLines=2147483647". The log has to include 
the all of the generated Java methods.

The community may take more time to address this problem than the case 
with the small program.

Best Regards,
Kazuaki Ishizaki



From:   Aakash Basu 
To: Kazuaki Ishizaki 
Cc: vaquar khan , Eyal Zituny 
, user 
Date:   2018/06/21 01:29
Subject:Re: [Help] Codegen Stage grows beyond 64 KB



Hi Kazuaki,

It would be really difficult to produce a small S-A code to reproduce this 
problem because, I'm running through a big pipeline of feature engineering 
where I derive a lot of variables based on the present ones to kind of 
explode the size of the table by many folds. Then, when I do any kind of 
join, this error shoots up.

I tried with wholeStage.codegen=false, but that errors out the entire 
program rather than running it with a lesser optimized code.

Any suggestion on how I can proceed towards a JIRA entry for this?

Thanks,
Aakash.

On Wed, Jun 20, 2018 at 9:41 PM, Kazuaki Ishizaki  
wrote:
Spark 2.3 tried to split a large generated Java methods into small methods 
as possible. However, this report may remain places that generates a large 
method.

Would it be possible to create a JIRA entry with a small stand alone 
program that can reproduce this problem? It would be very helpful that the 
community will address this problem.

Best regards,
Kazuaki Ishizaki



From:vaquar khan 
To:Eyal Zituny 
Cc:Aakash Basu , user <
user@spark.apache.org>
Date:2018/06/18 01:57
Subject:Re: [Help] Codegen Stage grows beyond 64 KB




Totally agreed with Eyal .

The problem is that when Java programs generated using Catalyst from 
programs using DataFrame and Dataset are compiled into Java bytecode, the 
size of byte code of one method must not be 64 KB or more, This conflicts 
with the limitation of the Java class file, which is an exception that 
occurs. 

In order to avoid occurrence of an exception due to this restriction, 
within Spark, a solution is to split the methods that compile and make 
Java bytecode that is likely to be over 64 KB into multiple methods when 
Catalyst generates Java programs It has been done.

Use persist or any other logical separation in pipeline.

Regards,
Vaquar khan 

On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny  
wrote:
Hi Akash,
such errors might appear in large spark pipelines, the root cause is a 
64kb jvm limitation.
the reason that your job isn't failing at the end is due to spark fallback 
- if code gen is failing, spark compiler will try to create the flow 
without the code gen (less optimized)
if you do not want to see this error, you can either disable code gen 
using the flag:  spark.sql.codegen.wholeStage= "false"
or you can try to split your complex pipeline into several spark flows if 
possible

hope that helps

Eyal

On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu  
wrote:
Hi,

I already went through it, that's one use case. I've a complex and very 
big pipeline of multiple jobs under one spark session. Not getting, on how 
to solve this, as it is happening over Logistic Regression and Random 
Forest models, which I'm just using from Spark ML package rather than 
doing anything by myself.

Thanks,
Aakash.

On Sun 17 Jun, 2018, 8:21 AM vaquar khan,  wrote:
Hi Akash,

Please check stackoverflow.

https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe


Regards,
Vaquar khan

On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu  
wrote:
Hi guys,

I'm getting an error when I'm feature engineering on 30+ columns to create 
about 200+ columns. It is not failing the job, but the ERROR shows. I want 
to know how can I avoid this.

Spark - 2.3.1
Python - 3.6

Cluster Config -
1 Master - 32 GB RAM, 16 Cores
4 Slaves - 16 GB RAM, 8 Cores


Input data - 8 partitions of parquet file with snappy compression.

My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077
--num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 
5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf 
spark.driver.maxResultSize=2G --conf 
"spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf 
spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py 
> /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt


Stack-Trace below -

ERROR CodeGenerator:91 - failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426"
 
grows beyond 64 KB

Re: spark kudu issues

2018-06-20 Thread William Berkeley
What you're seeing is the expected behavior in both cases.

One way to achieve the semantics you want in both situations is to read in
the Kudu table to a data frame, then filter it in Spark SQL to contain just
the rows you want to delete, and then use that dataframe to do the
deletion. There should be no primary key errors with that method unless the
table is concurrently being deleted from.

For 2), what I describe above is what Impala does- it reads in the Kudu
table, finds the full primary keys of rows matching your partial
specification of the key, and then issues deletes for those rows.

Note that deletes of multiple rows aren't transactional.

I think having a way to issue deletes that ignores primary key errors is
reasonable, a "delete ignore" analog to insert ignore. I filed KUDU-2482
 for it.

-Will


On Wed, Jun 20, 2018 at 8:31 AM, Pietro Gentile <
pietro.gentile89.develo...@gmail.com> wrote:

> Hi all,
>
> I am currently evaluating using Spark with Kudu.
> So I am facing the following issues:
>
> 1) If you try to DELETE a row with a key that is not present on the table
> you will have an Exception like this:
>
> java.lang.RuntimeException: failed to write N rows from DataFrame to Kudu;
> sample errors: Not found: key not found (error 0)
>
> 2) If you try to DELETE a row using a subset of a table key you will face
> the following:
>
> Caused by: java.lang.RuntimeException: failed to write N rows from
> DataFrame to Kudu; sample errors: Invalid argument: No value provided for
> key column:
>
> The use cases presented above are correctly working if you interact with
> kudu using Impala.
>
> Any suggestions to overcome these limitation?
>
> Thanks.
> Best Regards
>
> Pietro
>


Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread mattl156
Thanks. Unfortunately I dont have control over how data is inserted and the
table is not partitioned. 

The reason the sub directories are being created is because when Tez does an
INSERT into a table from a UNION query it creates sub directories so that it
can write in parallel. 

I've also realized that its only with Parquet ... csv works fine



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Aakash Basu
Hi Kazuaki,

It would be really difficult to produce a small S-A code to reproduce this
problem because, I'm running through a big pipeline of feature engineering
where I derive a lot of variables based on the present ones to kind of
explode the size of the table by many folds. Then, when I do any kind of
join, this error shoots up.

I tried with wholeStage.codegen=false, but that errors out the entire
program rather than running it with a lesser optimized code.

Any suggestion on how I can proceed towards a JIRA entry for this?

Thanks,
Aakash.

On Wed, Jun 20, 2018 at 9:41 PM, Kazuaki Ishizaki 
wrote:

> Spark 2.3 tried to split a large generated Java methods into small methods
> as possible. However, this report may remain places that generates a large
> method.
>
> Would it be possible to create a JIRA entry with a small stand alone
> program that can reproduce this problem? It would be very helpful that the
> community will address this problem.
>
> Best regards,
> Kazuaki Ishizaki
>
>
>
> From:vaquar khan 
> To:Eyal Zituny 
> Cc:Aakash Basu , user <
> user@spark.apache.org>
> Date:2018/06/18 01:57
> Subject:Re: [Help] Codegen Stage grows beyond 64 KB
> --
>
>
>
> Totally agreed with Eyal .
>
> The problem is that when Java programs generated using Catalyst from
> programs using DataFrame and Dataset are compiled into Java bytecode, the
> size of byte code of one method must not be 64 KB or more, This conflicts
> with the limitation of the Java class file, which is an exception that
> occurs.
>
> In order to avoid occurrence of an exception due to this restriction,
> within Spark, a solution is to split the methods that compile and make Java
> bytecode that is likely to be over 64 KB into multiple methods when
> Catalyst generates Java programs It has been done.
>
> Use persist or any other logical separation in pipeline.
>
> Regards,
> Vaquar khan
>
> On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny <*eyal.zit...@equalum.io*
> > wrote:
> Hi Akash,
> such errors might appear in large spark pipelines, the root cause is a
> 64kb jvm limitation.
> the reason that your job isn't failing at the end is due to spark fallback
> - if code gen is failing, spark compiler will try to create the flow
> without the code gen (less optimized)
> if you do not want to see this error, you can either disable code gen
> using the flag:  spark.sql.codegen.wholeStage= "false"
> or you can try to split your complex pipeline into several spark flows if
> possible
>
> hope that helps
>
> Eyal
>
> On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu <*aakash.spark@gmail.com*
> > wrote:
> Hi,
>
> I already went through it, that's one use case. I've a complex and very
> big pipeline of multiple jobs under one spark session. Not getting, on how
> to solve this, as it is happening over Logistic Regression and Random
> Forest models, which I'm just using from Spark ML package rather than doing
> anything by myself.
>
> Thanks,
> Aakash.
>
> On Sun 17 Jun, 2018, 8:21 AM vaquar khan, <*vaquar.k...@gmail.com*
> > wrote:
> Hi Akash,
>
> Please check stackoverflow.
>
>
> *https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe*
> 
>
> Regards,
> Vaquar khan
>
> On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu <*aakash.spark@gmail.com*
> > wrote:
> Hi guys,
>
> I'm getting an error when I'm feature engineering on 30+ columns to create
> about 200+ columns. It is not failing the job, but the ERROR shows. I want
> to know how can I avoid this.
>
> Spark - 2.3.1
> Python - 3.6
>
> Cluster Config -
> 1 Master - 32 GB RAM, 16 Cores
> 4 Slaves - 16 GB RAM, 8 Cores
>
>
> Input data - 8 partitions of parquet file with snappy compression.
>
> My Spark-Submit -> spark-submit --master spark://*192.168.60.20:7077*
> --num-executors 4 --executor-cores 5
> --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf
> spark.sql.shuffle.partitions=60 --conf spark.driver.maxResultSize=2G
> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf
> spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py
> > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt
>
>
> Stack-Trace below -
>
> ERROR CodeGenerator:91 - failed to compile: 
> org.codehaus.janino.InternalCompilerException:
> Compiling "GeneratedClass": Code of method "processNext()V" of class
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
> org.codehaus.janino.InternalCompilerException: Compiling
> "GeneratedClass": Code of method "processNext()V" of class
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> 

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Kazuaki Ishizaki
Spark 2.3 tried to split a large generated Java methods into small methods 
as possible. However, this report may remain places that generates a large 
method.

Would it be possible to create a JIRA entry with a small stand alone 
program that can reproduce this problem? It would be very helpful that the 
community will address this problem.

Best regards,
Kazuaki Ishizaki



From:   vaquar khan 
To: Eyal Zituny 
Cc: Aakash Basu , user 

Date:   2018/06/18 01:57
Subject:Re: [Help] Codegen Stage grows beyond 64 KB



Totally agreed with Eyal .

The problem is that when Java programs generated using Catalyst from 
programs using DataFrame and Dataset are compiled into Java bytecode, the 
size of byte code of one method must not be 64 KB or more, This conflicts 
with the limitation of the Java class file, which is an exception that 
occurs. 

In order to avoid occurrence of an exception due to this restriction, 
within Spark, a solution is to split the methods that compile and make 
Java bytecode that is likely to be over 64 KB into multiple methods when 
Catalyst generates Java programs It has been done.

Use persist or any other logical separation in pipeline.

Regards,
Vaquar khan 

On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny  
wrote:
Hi Akash,
such errors might appear in large spark pipelines, the root cause is a 
64kb jvm limitation.
the reason that your job isn't failing at the end is due to spark fallback 
- if code gen is failing, spark compiler will try to create the flow 
without the code gen (less optimized)
if you do not want to see this error, you can either disable code gen 
using the flag:  spark.sql.codegen.wholeStage= "false"
or you can try to split your complex pipeline into several spark flows if 
possible

hope that helps

Eyal

On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu  
wrote:
Hi,

I already went through it, that's one use case. I've a complex and very 
big pipeline of multiple jobs under one spark session. Not getting, on how 
to solve this, as it is happening over Logistic Regression and Random 
Forest models, which I'm just using from Spark ML package rather than 
doing anything by myself.

Thanks,
Aakash.

On Sun 17 Jun, 2018, 8:21 AM vaquar khan,  wrote:
Hi Akash,

Please check stackoverflow.

https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe

Regards,
Vaquar khan

On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu  
wrote:
Hi guys,

I'm getting an error when I'm feature engineering on 30+ columns to create 
about 200+ columns. It is not failing the job, but the ERROR shows. I want 
to know how can I avoid this.

Spark - 2.3.1
Python - 3.6

Cluster Config -
1 Master - 32 GB RAM, 16 Cores
4 Slaves - 16 GB RAM, 8 Cores


Input data - 8 partitions of parquet file with snappy compression.

My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077 
--num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 
5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf 
spark.driver.maxResultSize=2G --conf 
"spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf 
spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py 
> /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt

Stack-Trace below -

ERROR CodeGenerator:91 - failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426"
 
grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426"
 
grows beyond 64 KB
at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1417)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 

spark kudu issues

2018-06-20 Thread Pietro Gentile
Hi all,

I am currently evaluating using Spark with Kudu.
So I am facing the following issues:

1) If you try to DELETE a row with a key that is not present on the table
you will have an Exception like this:

java.lang.RuntimeException: failed to write N rows from DataFrame to Kudu;
sample errors: Not found: key not found (error 0)

2) If you try to DELETE a row using a subset of a table key you will face
the following:

Caused by: java.lang.RuntimeException: failed to write N rows from
DataFrame to Kudu; sample errors: Invalid argument: No value provided for
key column:

The use cases presented above are correctly working if you interact with
kudu using Impala.

Any suggestions to overcome these limitation?

Thanks.
Best Regards

Pietro


Re: G1GC vs ParallelGC

2018-06-20 Thread vaquar khan
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Regards,
Vaquar khan

On Wed, Jun 20, 2018, 1:18 AM Aakash Basu 
wrote:

> Hi guys,
>
> I just wanted to know, why my ParallelGC (*--conf
> "spark.executor.extraJavaOptions=-XX:+UseParallelGC"*) in a very long
> Spark ML Pipeline works faster than when I set G1GC (*--conf
> "spark.executor.extraJavaOptions=-XX:+UseG1GC"*), even though the Spark
> community suggests G1GC to be much better than the ParallelGC.
>
> Any pointers?
>
> Thanks,
> Aakash.
>


Re: load hbase data using spark

2018-06-20 Thread vaquar khan
Why you need tool,you can directly connect Hbase using spark.

Regards,
Vaquar khan

On Jun 18, 2018 4:37 PM, "Lian Jiang"  wrote:

Hi,

I am considering tools to load hbase data using spark. One choice is
https://github.com/Huawei-Spark/Spark-SQL-on-HBase. However, this seems to
be out-of-date (e.g. "This version of 1.0.0 requires Spark 1.4.0."). Which
tool should I use for this purpose? Thanks for any hint.


Re: Spark application complete it's job successfully on Yarn cluster but yarn register it as failed

2018-06-20 Thread Sonal Goyal
Have you checked the logs - they probably should have some more details.

On Wed 20 Jun, 2018, 2:51 PM Soheil Pourbafrani, 
wrote:

> Hi,
>
> I run a Spark application on Yarn cluster and it complete the process
> successfully, but at the end Yarn print in the console:
>
> client token: N/A
> diagnostics: Application application_1529485137783_0004 failed 4 times due
> to AM Container for appattempt_1529485137783_0004_04 exited with
> exitCode: 1
> Failing this attempt.Diagnostics: Exception from container-launch.
> Container id: container_e447_1529485137783_0004_04_01
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
> at org.apache.hadoop.util.Shell.run(Shell.java:869)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
>
> Container exited with a non-zero exit code 1
> For more detailed output, check the application tracking page:
> http://snamenode:8088/cluster/app/application_1529485137783_0004 Then
> click on links to logs of each attempt.
> . Failing the application.
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: streamQ
> start time: 1529485513957
> final status: FAILED
> tracking URL:
> http://snamenode:8088/cluster/app/application_1529485137783_0004
> user: manager
> Exception in thread "main" org.apache.spark.SparkException: Application
> application_1529485137783_0004 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> and in Yarn UI the app status is registered failed. Waht's the problem?
>


Apache Spark use case: correlate data strings from file

2018-06-20 Thread darkdrake
Hi, I’m new on Spark and I’m trying to understand if it can fit my use case.

I have the following scenario.
I have a file (it can be a log file, .txt, .csv, .xml or .json, I can
produce the data in whatever format I prefer) with some data, e.g.:
*Event “X”, City “Y”, Zone “Z”*

with different events, cities and zones. This data can be represented by
string (like the one I wrote) in a .txt, or by XML , CSV, or JSON, as I
wish. I can also send this data through TCP Socket, if I need it.

What I really want to do is to *correlate each single entry with other
similar entries by declaring rules*.
For example, I want to declare some rules on the data flow: if I received
event X1 and event X2 in same city and same zone, I’ll want to do something
(execute a .bat script, write a log file, etc). Same thing if I received the
same string multiple times, or whatever rule I want to produce with these
data strings.
I’m trying to understand if Apache Spark can fit my use case. The only input
data will be these strings from this file. Can I correlate these events and
how? Is there a GUI to do it? 

Any hints and advices will be appreciated.
Best regards,
Simone




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark application complete it's job successfully on Yarn cluster but yarn register it as failed

2018-06-20 Thread Soheil Pourbafrani
Hi,

I run a Spark application on Yarn cluster and it complete the process
successfully, but at the end Yarn print in the console:

client token: N/A
diagnostics: Application application_1529485137783_0004 failed 4 times due
to AM Container for appattempt_1529485137783_0004_04 exited with
exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_e447_1529485137783_0004_04_01
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 1
For more detailed output, check the application tracking page:
http://snamenode:8088/cluster/app/application_1529485137783_0004 Then click
on links to logs of each attempt.
. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: streamQ
start time: 1529485513957
final status: FAILED
tracking URL:
http://snamenode:8088/cluster/app/application_1529485137783_0004
user: manager
Exception in thread "main" org.apache.spark.SparkException: Application
application_1529485137783_0004 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


and in Yarn UI the app status is registered failed. Waht's the problem?


Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread Daniel Pires
Hi Matt,

What I tend to do is partition by date in the following way:

s3://data-lake/pipeline1/extract_year=2018/extract_month=06/extract_day=20/file1.json


See the pattern is key=value for physical partitions

When you read that like this:
spark.read.json("s3://data-lake/pipeline1/")

It will bring you the data with the schema inferred from the JSON + 3
fields : extract_year, extract_month, extract_day

I was looking for the documentation that described this way of partitioning
but could not find it, I can reply with the link once I do

Hope that helps,
---
Daniel Mateus Pires
Data Engineer
Hudson's Bay Company

On Wed, Jun 20, 2018 at 5:39 AM, mattl156  wrote:

> Hello,
>
>
>
> We have a number of Hive tables (non partitioned) that are populated with
> subdirectories. (result of tez execution engine union queries)
>
>
>
> E.g. Table location: “s3://table1/” With the actual data residing in:
>
>
>
> s3://table1/1/data1
>
> s3://table1/2/data2
>
> s3://table1/3/data3
>
>
>
> When using SparkSession (sql/hiveContext has the same behavior) and
> spark.sql to query the data, no records are displayed due to these
> subdirectories.
>
>
>
> e.g
>
> val df = spark.sql("select * from db.table1").show()
>
>
>
> I’ve tried a number of setConf properties e.g.
> spark.hive.mapred.supports.subdirectories=true,
> mapreduce.input.fileinputformat.input.dir.recursive=true but it does not
> look like any of these properties are supported.
>
>
>
> Has anyone run into similar problems or ways to resolve it? Our current
> alternatives are reading the input path directory directly e.g.:
>
>
>
>
> spark.read.csv("s3://bucket-name/table1/bullseye_segments/*/*")
>
>
> But this requires prior knowledge of the path or an extra step to determine
> it.
>
>
> Thanks,
>
> Matt
>
>
>
>
>
> --
> Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-
> 2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_=DwIFaQ=T4LuzJg_
> R6QwRnqJoo4xTCUXoKbdWTdhZj7r4OYEklY=jiwJwiY6eWNA-
> UciMYI1Iw=SDQ6bALTDqRwG-TQZ1-xlEaHQDddyTn38FcaOmp8dDk=
> ZvFFhUgiKT1JC1NMH6hI44Gx8pp3OwXHcrhbTUISvHg=
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[Spark Streaming] Are SparkListener/StreamingListener callbacks called concurrently?

2018-06-20 Thread Majid Azimi
Hi,

What is the concurrency model behind SparkListener or StreamingListener 
callbacks?
1. Multiple threads might access callbacks simultaneously.
2. Callbacks are guaranteed to be executed by a single thread.(Thread ids might 
change on consecutive calls, though)

I asked the same question on stackoverflow, and waited for about one day. Since 
there was no response, I'm reposting it here.

https://stackoverflow.com/questions/50921585/are-sparklistener-streaminglistener-callbacks-called-concurrently

Way to avoid CollectAsMap in RandomForest

2018-06-20 Thread Aakash Basu
Hi,

I'm running RandomForest model from Spark ML API on a medium sized data
(2.25 million rows and 60 features), most of my time goes in the
CollectAsMap of RandomForest but I've no option to avoid it as it is in the
API.

Is there a way to cutshort my end to end runtime?



Thanks,
Aakash.


G1GC vs ParallelGC

2018-06-20 Thread Aakash Basu
Hi guys,

I just wanted to know, why my ParallelGC (*--conf
"spark.executor.extraJavaOptions=-XX:+UseParallelGC"*) in a very long Spark
ML Pipeline works faster than when I set G1GC (*--conf
"spark.executor.extraJavaOptions=-XX:+UseG1GC"*), even though the Spark
community suggests G1GC to be much better than the ParallelGC.

Any pointers?

Thanks,
Aakash.