from:"Zhan Zhang"


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Attachment: HBASE-14801-3.patch

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch, 
> HBASE-14801-3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Status: Patch Available  (was: In Progress)

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Status: In Progress  (was: Patch Available)

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Attachment: HBASE-14801-2.patch

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Attachment: (was: HBASE-14801-2.patch)

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: ORC file writing hangs in pyspark

2016-02-23 Thread Zhan Zhang

Hi James,

You can try to write with other format, e.g., parquet to see whether it is a 
orc specific issue or more generic issue.

Thanks.

Zhan Zhang

On Feb 23, 2016, at 6:05 AM, James Barney 
mailto:jamesbarne...@gmail.com>> wrote:

I'm trying to write an ORC file after running the FPGrowth algorithm on a 
dataset of around just 2GB in size. The algorithm performs well and can display 
results if I take(n) the freqItemSets() of the result after converting that to 
a DF.

I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn.

I get the results from querying a Hive table, also ORC format, running a number 
of maps, joins, and filters on the data.

When the program attempts to write the files:
result.write.orc('/data/staged/raw_result')
  size_1_buckets.write.orc('/data/staged/size_1_results')
  filter_size_2_buckets.write.orc('/data/staged/size_2_results')

The first path, /data/staged/raw_result, is created with a _temporary folder, 
but the data is never written. The job hangs at this point, apparently 
indefinitely.

Additionally, no logs are recorded or available for the jobs on the history 
server.

What could be the problem?

[jira] [Commented] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-02-23 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159806#comment-15159806
 ] 

Zhan Zhang commented on HBASE-14801:


Will update the scoreboard after the sanity test by server.

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-02-23 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Attachment: HBASE-14801-2.patch

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2016-01-30 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125205#comment-15125205
 ] 

Zhan Zhang commented on SPARK-7009:
---

Yes. This one is obsoleted.

> Build assembly JAR via ant to avoid zip64 problems
> --
>
> Key: SPARK-7009
> URL: https://issues.apache.org/jira/browse/SPARK-7009
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
> Environment: Java 7+
>Reporter: Steve Loughran
> Attachments: check_spark_python.sh
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
> format incompatible with Java and pyspark.
> Provided the total number of .class files+resources is <64K, ant can be used 
> to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
> then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-01-26 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118266#comment-15118266
 ] 

Zhan Zhang commented on HBASE-14801:


Looks like most of warning does not apply to this patch. I will update the 
patch after collecting more feedback.

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-11075) Spark SQL Thrift Server authentication issue on kerberized yarn cluster

2016-01-22 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113560#comment-15113560
 ] 

Zhan Zhang commented on SPARK-11075:


Duplicated to SPARK-5159?

> Spark SQL Thrift Server authentication issue on kerberized yarn cluster 
> 
>
> Key: SPARK-11075
> URL: https://issues.apache.org/jira/browse/SPARK-11075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 1.5.1
> Environment: hive-1.2.1
> hadoop-2.6.0 config kerbers
>Reporter: Xiaoyu Wang
>
> Use proxy user connect to the thrift server by beeline but got permission 
> exception:
> 1.Start the hive 1.2.1 metastore with user hive
> {code}
> $kinit -kt /tmp/hive.keytab hive/xxx
> $nohup ./hive --service metastore 2>&1 >> ../logs/metastore.log &
> {code}
> 2.Start the spark thrift server with user hive
> {code}
> $kinit -kt /tmp/hive.keytab hive/xxx
> $./start-thriftserver.sh --master yarn
> {code}
> 3.Connect to the thrift server with proxy user hive01
> {code}
> $kinit hive01
> beeline command:!connect 
> jdbc:hive2://xxx:1/default;principal=hive/x...@hadoop.com;kerberosAuthType=kerberos;hive.server2.proxy.user=hive01
> {code}
> 4.Create table and insert data
> {code}
> create table test(name string);
> insert overwrite table test select * from sometable;
> {code}
> the insert sql got exception:
> {noformat}
> Error: org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=hive01, access=WRITE, 
> inode="/user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1/_temporary/0/task_201510100917_0003_m_00":hive:hadoop:drwxr-xr-x
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:182)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6512)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3775)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3739)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rename(NameNodeRpcServer.java:754)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.rename(ClientNamenodeProtocolServerSideTranslatorPB.java:565)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> (state=,code=0)
> {noformat}
> The table path on HDFS:
> {noformat}
> drwxrwxrwx   - hive   hadoop  0 2015-10-10 09:14 
> /user/hive/warehouse/test
> drwxrwxrwx   - hive01 hadoop  0 2015-10-10 09:17 
> /user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2
> drwxr-xr-x   - hive01 hadoop  0 2015-10-10 09:17 
> /user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1
> drwxr-xr-x   - hive01 hadoop  0 2015-10-10 09:17 
> /user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1/_temporary
> drwxr-xr-x   - hive01 hadoop  0 2015-10-10 09:17 
> /user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1/_t

[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-22 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113469#comment-15113469
 ] 

Zhan Zhang commented on SPARK-5159:
---

[~luciano resende] Given the current code base, I don't think impersonation 
works only I miss something. In you case, you may want to verify who are 
accessing the hdfs, it is driver, or executor?  You may retry the case with the 
other way (with right permission to see whether executor can access the file 
correctly).

Currently, driver does have impersonation if configured, but executors does not 
support it. 

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-01-20 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Status: Patch Available  (was: Open)

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-01-20 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Attachment: HBASE-14801-1.patch

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-15 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102183#comment-15102183
 ] 

Zhan Zhang edited comment on SPARK-5159 at 1/15/16 5:50 PM:


What happen if an user have a valid visit to a table, which will be saved in 
catalog. Another user then also can visit the table as it is cached in local 
hivecatalog, even if the latter does not have the access to the table meta 
data, right? To make the impersonate to work, all the information has to be 
tagged by user, right?


was (Author: zzhan):
What happen if an user have a valid visit to a table, which will be saved in 
catalog. Another user then also can visit the table as it is cached in local 
hivecatalog, even if the latter does not have the access to the table, right? 
To make the impersonate to really work, all the information has to be tagged by 
user, right?

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-15 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102183#comment-15102183
 ] 

Zhan Zhang commented on SPARK-5159:
---

What happen if an user have a valid visit to a table, which will be saved in 
catalog. Another user then also can visit the table as it is cached in local 
hivecatalog, even if the latter does not have the access to the table, right? 
To make the impersonate to really work, all the information has to be tagged by 
user, right?

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-14 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098734#comment-15098734
 ] 

Zhan Zhang commented on SPARK-5159:
---

This issue is definitely broken. But fixing it needs a complete design being 
review first. 

For example, to enable the impersonation (doAs) at runtime, how do we solve the 
RDD sharing between different users?

We can propagate the user to the executor piggybacked by TaskDescription. But 
what happen if two user operate on two RDDs which share the same parent, cache 
created by another user. Currently, RDD scope is SparkContext without any user 
information. It means even we do impersonation, it is meaningless per my 
understanding.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Review Request 42118: AMBARI-14601 Disable impersonation in spark hive support

2016-01-10 Thread Zhan Zhang


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42118/
---

Review request for Ambari and Robert Levas.


Bugs: AMBARI-14601
https://issues.apache.org/jira/browse/AMBARI-14601


Repository: ambari


Description
---

Currently spark thriftserver cannot do impersonation correctly. We have to 
disable this feature.


Diffs
-

  
ambari-server/src/main/resources/stacks/HDP/2.3/services/SPARK/configuration/spark-hive-site-override.xml
 8f0bc62 

Diff: https://reviews.apache.org/r/42118/diff/


Testing
---

Manual test is done, and it works as expected.
Without the patch, we hit the file permission issue. Disabling the 
impersonation fix the issue as below
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)


Thanks,

Zhan Zhang

[jira] [Updated] (AMBARI-14601) Disable impersonation in spark

2016-01-10 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMBARI-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated AMBARI-14601:

Attachment: AMBARI-14601.patch

set hive.server2.enable.doAs to false

> Disable impersonation in spark
> --
>
> Key: AMBARI-14601
> URL: https://issues.apache.org/jira/browse/AMBARI-14601
> Project: Ambari
>  Issue Type: Bug
>    Reporter: Zhan Zhang
> Attachments: AMBARI-14601.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AMBARI-14601) Disable impersonation in spark

2016-01-10 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/AMBARI-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091161#comment-15091161
 ] 

Zhan Zhang commented on AMBARI-14601:
-

Currently spark thriftserver cannot do impersonation correctly. We have to 
disable this feature.

> Disable impersonation in spark
> --
>
> Key: AMBARI-14601
> URL: https://issues.apache.org/jira/browse/AMBARI-14601
> Project: Ambari
>  Issue Type: Bug
>Reporter: Zhan Zhang
> Attachments: AMBARI-14601.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (AMBARI-14601) Disable impersonation in spark

2016-01-10 Thread Zhan Zhang (JIRA)

Zhan Zhang created AMBARI-14601:
---

 Summary: Disable impersonation in spark
 Key: AMBARI-14601
 URL: https://issues.apache.org/jira/browse/AMBARI-14601
 Project: Ambari
  Issue Type: Bug
Reporter: Zhan Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Dr.appointment this afternoon and WFH tomorrow for another Dr. appointment (EOM)

2016-01-07 Thread Zhan Zhang



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Commented] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-01-06 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086414#comment-15086414
 ] 

Zhan Zhang commented on HBASE-14801:


I will start to working on this. Please let me know if anyone have any concerns 
or comments.

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Improvement
>Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14796) Enhance the Gets in the connector

2015-12-28 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14796:
---
Attachment: HBASE-14796-1.patch

solve review comments

> Enhance the Gets in the connector
> -
>
> Key: HBASE-14796
> URL: https://issues.apache.org/jira/browse/HBASE-14796
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14796-1.patch, HBASE-14976.patch
>
>
> Current the Spark-Module Spark SQL implementation gets records from HBase 
> from the driver if there is something like the following found in the SQL.
> rowkey = 123
> The reason for this original was normal sql will not have many equal 
> operations in a single where clause.
> Zhan, had brought up too points that have value.
> 1. The SQL may be generated and may have many many equal statements in it so 
> moving the work to an executor protects the driver from load
> 2. In the correct implementation the drive is connecting to HBase and 
> exceptions may cause trouble with the Spark application and not just with the 
> a single task execution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Problem using limit clause in spark sql

2015-12-23 Thread Zhan Zhang

There has to have a central point to collaboratively collecting exactly 1 
records, currently the approach is using one single partitions, which is easy 
to implement.
Otherwise, the driver has to count the number of records in each partition and 
then decide how many records  to be materialized in each partition, because 
some partition may not have enough number of records, sometimes it is even 
empty.

I didn’t see any straightforward walk around for this.

Thanks.

Zhan Zhang



On Dec 23, 2015, at 5:32 PM, 汪洋 
mailto:tiandiwo...@icloud.com>> wrote:

It is an application running as an http server. So I collect the data as the 
response.

在 2015年12月24日，上午8:22，Hudong Wang 
mailto:justupl...@hotmail.com>> 写道：

When you call collect() it will bring all the data to the driver. Do you mean 
to call persist() instead?


From: tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com>
Subject: Problem using limit clause in spark sql
Date: Wed, 23 Dec 2015 21:26:51 +0800
To: user@spark.apache.org<mailto:user@spark.apache.org>

Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 1”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records into a 
single partition, resulting the map afterwards running only in one partition 
and being really slow.I tried to use repartition, but it is kind of a waste to 
collect all those records into one partition and then shuffle them around and 
then collect them again.

Is there a way to work around this?
BTW, there is no order by clause and I do not care which 1 records I get as 
long as the total number is less or equal then 1.

[jira] [Commented] (HBASE-14796) Enhance the Gets in the connector

2015-12-23 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070286#comment-15070286
 ] 

Zhan Zhang commented on HBASE-14796:


Thanks [~ted.m] for the quick review. It is reasonable to have a performance 
test, and I will try to grab some physical cluster for it. It may take some 
time, as I don't have physical cluster for this. 

On the other hand, I do think we should change it to perform BulkGet in 
executors regardless the performance (although I think it should improve the 
performance instead of the other way), because:

1. Current implementation do gather-scatter in driver, which would increase 
network overhead and latency if the number of gets is big.

2. Failure recovery. It is hard to do failure recovery as it is performed in 
driver, which is single point of failure.

The above two have been discussed in details. But I just realized there is 
another potential issue, which the current implementation may be against Spark 
SQL engine design as below.

3. Currently, the bulkGet is happening in the query plan (buildScan), and the 
results will stay in driver (1st). The result is distributed to executors in 
query execution(2nd). 
  3.1 1st and 2nd are not always happening in pair. Even worse, sometimes only 
1st is happening, for example, users do plan.explain, but may never trigger the 
plan execution. 
  3.2 Memory taken by table.get may never get released in driver, increase the 
driver memory overhead.

[~ted.m] Please let me know how do you think, and correct me if my 
understanding is wrong.

> Enhance the Gets in the connector
> -
>
> Key: HBASE-14796
> URL: https://issues.apache.org/jira/browse/HBASE-14796
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14976.patch
>
>
> Current the Spark-Module Spark SQL implementation gets records from HBase 
> from the driver if there is something like the following found in the SQL.
> rowkey = 123
> The reason for this original was normal sql will not have many equal 
> operations in a single where clause.
> Zhan, had brought up too points that have value.
> 1. The SQL may be generated and may have many many equal statements in it so 
> moving the work to an executor protects the driver from load
> 2. In the correct implementation the drive is connecting to HBase and 
> exceptions may cause trouble with the Spark application and not just with the 
> a single task execution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Unable to create hive table using HiveContext

2015-12-23 Thread Zhan Zhang

You are using embedded mode, which will create the db locally (in your case, 
maybe the db has been created, but you do not have right permission?).

To connect to remote metastore, hive-site.xml has to be correctly configured.

Thanks.

Zhan Zhang


On Dec 23, 2015, at 7:24 AM, Soni spark 
mailto:soni2015.sp...@gmail.com>> wrote:

Hi friends,

I am trying to create hive table through spark with Java code in Eclipse using 
below code.

HiveContext sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc.sc<http://sc.sc/>());
   sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");


but i am getting error

RROR XBM0J: Directory /home/workspace4/Test/metastore_db already exists.

I am not sure why metastore creating in workspace. Please help me.

Thanks
Soniya

Re: DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Zhan Zhang

Now json, parquet, orc(in hivecontext), text are natively supported. If you use 
avro or others, you have to include the package, which are not built into spark 
jar.

Thanks.

Zhan Zhang

On Dec 23, 2015, at 8:57 AM, Christopher Brady 
mailto:christopher.br...@oracle.com>> wrote:

DataFrameWriter.format

[jira] [Updated] (HBASE-14796) Enhance the Gets in the connector

2015-12-23 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14796:
---
Release Note: spark.hbase.bulkGetSize  in HBaseSparkConf is for grouping 
bulkGet, and default value is 1000.
  Status: Patch Available  (was: Open)

> Enhance the Gets in the connector
> -
>
> Key: HBASE-14796
> URL: https://issues.apache.org/jira/browse/HBASE-14796
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14976.patch
>
>
> Current the Spark-Module Spark SQL implementation gets records from HBase 
> from the driver if there is something like the following found in the SQL.
> rowkey = 123
> The reason for this original was normal sql will not have many equal 
> operations in a single where clause.
> Zhan, had brought up too points that have value.
> 1. The SQL may be generated and may have many many equal statements in it so 
> moving the work to an executor protects the driver from load
> 2. In the correct implementation the drive is connecting to HBase and 
> exceptions may cause trouble with the Spark application and not just with the 
> a single task execution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14796) Enhance the Gets in the connector

2015-12-23 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14796:
---
Attachment: HBASE-14976.patch

We have use case where bulkget may consists of thousands of gets. Move BulkGet 
to executor side from driver, which will improve the  failure recovery, and 
potentially improve the performance as well when the gets number is big.

> Enhance the Gets in the connector
> -
>
> Key: HBASE-14796
> URL: https://issues.apache.org/jira/browse/HBASE-14796
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14976.patch
>
>
> Current the Spark-Module Spark SQL implementation gets records from HBase 
> from the driver if there is something like the following found in the SQL.
> rowkey = 123
> The reason for this original was normal sql will not have many equal 
> operations in a single where clause.
> Zhan, had brought up too points that have value.
> 1. The SQL may be generated and may have many many equal statements in it so 
> moving the work to an executor protects the driver from load
> 2. In the correct implementation the drive is connecting to HBase and 
> exceptions may cause trouble with the Spark application and not just with the 
> a single task execution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Can SqlContext be used inside mapPartitions

2015-12-22 Thread Zhan Zhang

SQLContext is in driver side, and I don’t think you can use it in executors. 
How to provide lookup functionality in executors really depends on how you 
would use them. 

Thanks.

Zhan Zhang

On Dec 22, 2015, at 4:44 PM, SRK  wrote:

> Hi,
> 
> Can SQL Context be used inside mapPartitions? My requirement is to register
> a set of data from hdfs as a temp table and to be able to lookup from inside
> MapPartitions based on a key. If it is not supported, is there a different
> way of doing this?
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-SqlContext-be-used-inside-mapPartitions-tp25771.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Zhan Zhang

BTW: It is not only a Yarn-webui issue. In capacity scheduler, vcore is 
ignored. If you want Yarn to honor vcore requests, you have to use 
DominantResourceCalculator as Saisai suggested.

Thanks.

Zhan Zhang

On Dec 21, 2015, at 5:30 PM, Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:

 and you'll see the right vcores y

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-21 Thread Zhan Zhang

This looks to me is a very unusual use case. You stop the SparkContext, and 
start another one. I don’t think it is well supported. As the SparkContext is 
stopped, all the resources are supposed to be released. 

Is there any mandatory reason you have stop and restart another SparkContext.

Thanks.

Zhan Zhang

Note that when sc is stopped, all resources are released (for example in yarn 
On Dec 20, 2015, at 2:59 PM, Jerry Lam  wrote:

> Hi Spark developers,
> 
> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave 
> correctly when a different spark context is provided.
> 
> ```
> val sc = new SparkContext
> val sqlContext =SQLContext.getOrCreate(sc)
> sc.stop
> ...
> 
> val sc2 = new SparkContext
> val sqlContext2 = SQLContext.getOrCreate(sc2)
> sc2.stop
> ```
> 
> The sqlContext2 will reference sc instead of sc2 and therefore, the program 
> will not work because sc has been stopped. 
> 
> Best Regards,
> 
> Jerry 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark with log4j

2015-12-21 Thread Zhan Zhang

Hi Kalpesh,

If you are using spark on yarn, it may not work. Because you write log to files
other than stdout/stderr, which yarn log aggregation may not work. As I
understand, yarn only aggregate log in stdout/stderr, and local cache will be
deleted (in configured timeframe).

To check it, at application run time, you can log into the container’s box, and
check the local cache of the container to find whether the log file exists or
not (after app terminate, these local cache files will be deleted as well).

Thanks.

Zhan Zhang

On Dec 18, 2015, at 7:23 AM, Kalpesh Jadhav
mailto:kalpesh.jad...@citiustech.com>> wrote:

Hi all,

I am new to spark, I am trying to use log4j for logging my application.
But any how the logs are not getting written at specified file.

I have created application using maven, and kept log.properties file at
resources folder.
Application written in scala .

If there is any alternative instead of log4j then also it will work, but I
wanted to see logs in file.

If any changes need to be done in
hortonworks<https://www.google.co.in/search?client=firefox-a&rls=org.mozilla:en-US:official&channel=fflb&q=hortonworks&spell=1&sa=X&ved=0ahUKEwj5k4Gq2-XJAhXUB44KHYU-C6MQvwUIGSgA>
for spark configuration, please mentioned that as well.

If anyone has done before or on github any source available please respond.

Thanks,
Kalpesh Jadhav
===
DISCLAIMER: The information contained in this message (including any
attachments) is confidential and may be privileged. If you have received it by
mistake please notify the sender by return e-mail and permanently delete this
message and any attachments from your system. Any dissemination, use, review,
distribution, printing or copying of this message in whole or in part is
strictly prohibited. Please note that e-mails are susceptible to change.
CitiusTech shall not be liable for the improper or incomplete transmission of
the information contained in this communication nor for any delay in its
receipt or damage to your system. CitiusTech does not guarantee that the
integrity of this communication has been maintained or that this communication
is free of viruses, interceptions or interferences.

Re: number limit of map for spark

2015-12-21 Thread Zhan Zhang

What I mean is to combine multiple map functions into one. Don’t know how 
exactly your algorithms works. Did your one iteration result depend on last 
iteration? If so, how do they depend on?
I think either you can optimize your implementation, or Spark is not the right 
one for your specific application.

Thanks.

Zhan Zhang

On Dec 21, 2015, at 10:43 AM, Zhiliang Zhu 
mailto:zchl.j...@yahoo.com.INVALID>> wrote:

What is difference between repartition  / collect and   collapse ...
Is collapse the same costly as collect or repartition ?

Thanks in advance ~


On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang 
mailto:zzh...@hortonworks.com>> wrote:


In what situation, you have such cases? If there is no shuffle, you can 
collapse all these functions into one, right? In the meantime, it is not 
recommended to collect
all data to driver.

Thanks.

Zhan Zhang

On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu 
mailto:zchl.j...@yahoo.com.INVALID>> wrote:

Dear All,

I need to iterator some job / rdd quite a lot of times, but just lost in the 
problem of
spark only accept to call around 350 number of map before it meets one action 
Function ,
besides, dozens of action will obviously increase the run time.
Is there any proper way ...

As tested, there is piece of codes as follows:

..
 83 int count = 0;
 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with 
only 1 partition
 85 int m = 350;
 86 JavaRDD r = dataSet.cache();
 87 JavaRDD t = null;
 88
 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd 
r to t
 90   if(null != t) {
 91 r = t;
 92   }
//inner loop to call map 350 times , if m is much more than 350 
(for instance, around 400), then the job will throw exception message
  "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw 
exception: java.lang.StackOverflowError java.lang.StackOverflowError")
 93   for(int i=0; i < m; ++i) {
 94 r = r.map(new Function() {
 95   @Override
 96   public Integer call(Integer integer) {
 97 double x = Math.random() * 2 - 1;
 98 double y = Math.random() * 2 - 1;
 99 return (x * x + y * y < 1) ? 1 : 0;
100   }
101 });

104   }
105
106   List lt = r.collect(); //then collect this rdd to get 
another rdd, however, dozens of action Function as collect is VERY MUCH COST
107   t = jsc.parallelize(lt, 1).cache();
108
109 }
110
..

Thanks very much in advance!
Zhiliang

Re: number limit of map for spark

2015-12-21 Thread Zhan Zhang

In what situation, you have such cases? If there is no shuffle, you can 
collapse all these functions into one, right? In the meantime, it is not 
recommended to collect
all data to driver.

Thanks.

Zhan Zhang

On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu 
mailto:zchl.j...@yahoo.com.INVALID>> wrote:

Dear All,

I need to iterator some job / rdd quite a lot of times, but just lost in the 
problem of
spark only accept to call around 350 number of map before it meets one action 
Function ,
besides, dozens of action will obviously increase the run time.
Is there any proper way ...

As tested, there is piece of codes as follows:

..
 83 int count = 0;
 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with 
only 1 partition
 85 int m = 350;
 86 JavaRDD r = dataSet.cache();
 87 JavaRDD t = null;
 88
 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd 
r to t
 90   if(null != t) {
 91 r = t;
 92   }
//inner loop to call map 350 times , if m is much more than 350 
(for instance, around 400), then the job will throw exception message
  "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw 
exception: java.lang.StackOverflowError java.lang.StackOverflowError")
 93   for(int i=0; i < m; ++i) {
 94 r = r.map(new Function() {
 95   @Override
 96   public Integer call(Integer integer) {
 97 double x = Math.random() * 2 - 1;
 98 double y = Math.random() * 2 - 1;
 99 return (x * x + y * y < 1) ? 1 : 0;
100   }
101 });

104   }
105
106   List lt = r.collect(); //then collect this rdd to get 
another rdd, however, dozens of action Function as collect is VERY MUCH COST
107   t = jsc.parallelize(lt, 1).cache();
108
109 }
110
..

Thanks very much in advance!
Zhiliang

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions

2015-12-18 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Release Note: 
For user configurable parameters for HBase datasources. Please refer to 
org.apache.hadoop.hbase.spark.datasources.HBaseSparkConf for details. 

User can either set them in SparkConf, which will take effect globally, or 
configure it per table, which will overwrite the value set in SparkConf. If not 
set, the default value will take effect.

Currently three parameters are supported.
1. spark.hbase.blockcache.enable for blockcache enable/disable. Default is 
enable,  but note that this potentially may slow down the system.
2. spark.hbase.cacheSize for cache size when performing HBase table scan. 
Default value is 1000
3. spark.hbase.batchNum for the batch number when performing HBase table scan. 
Default value is 1000.

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
> Attachments: HBASE-14849-1.patch, HBASE-14849-2.patch, 
> HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14849) Add option to set block cache to false on SparkSQL executions

2015-12-18 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064656#comment-15064656
 ] 

Zhan Zhang commented on HBASE-14849:


[~ted_yu] Not very familiar with it. Could you clarify which doc I need to 
update? Thanks.

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
> Attachments: HBASE-14849-1.patch, HBASE-14849-2.patch, 
> HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions

2015-12-18 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Attachment: HBASE-14849-2.patch

Solve review comments.

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
> Attachments: HBASE-14849-1.patch, HBASE-14849-2.patch, 
> HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions

2015-12-17 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Attachment: HBASE-14849-1.patch

fix style check. The javadoc warning is not related to this jira.

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
> Attachments: HBASE-14849-1.patch, HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14991) Fix the feature warning in scala code

2015-12-17 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14991:
---
Attachment: HBASE-14991-1.patch

Attach the same file to kick off the testing

> Fix the feature warning in scala code
> -
>
> Key: HBASE-14991
> URL: https://issues.apache.org/jira/browse/HBASE-14991
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14991-1.patch, HBASE-14991.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Spark big rdd problem

2015-12-15 Thread Zhan Zhang

There are two cases here. If the container is killed by yarn, you can increase 
jvm overhead. Otherwise, you have to increase the executor-memory if there is 
no memory leak happening.

Thanks.

Zhan Zhang

On Dec 15, 2015, at 9:58 PM, Eran Witkon 
mailto:eranwit...@gmail.com>> wrote:

If the problem is containers trying to use more memory then they allowed, how 
do I limit them? I all ready have executor-memory 5G
Eran
On Tue, 15 Dec 2015 at 23:10 Zhan Zhang 
mailto:zzh...@hortonworks.com>> wrote:
You should be able to get the logs from yarn by “yarn logs -applicationId xxx”, 
where you can possible find the cause.

Thanks.

Zhan Zhang

On Dec 15, 2015, at 11:50 AM, Eran Witkon 
mailto:eranwit...@gmail.com>> wrote:

> When running
> val data = sc.wholeTextFile("someDir/*") data.count()
>
> I get numerous warning from yarn till I get aka association exception.
> Can someone explain what happen when spark loads this rdd and can't fit it 
> all in memory?
> Based on the exception it looks like the server is disconnecting from yarn 
> and failing... Any idea why? The code is simple but still failing...
> Eran

[jira] [Commented] (HBASE-14849) Add option to set block cache to false on SparkSQL executions


[ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059555#comment-15059555
 ] 

Zhan Zhang commented on HBASE-14849:


I use following command, but didn't find any javadoc warning. Will fix other 
issue after gathering review comments.

mvn clean package javadoc:javadoc -DskipTests -DHBasePatchProcess

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
> Attachments: HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059533#comment-15059533
 ] 

Zhan Zhang commented on HBASE-14795:


[~jmhsieh] HBASE-14991 is opened for this, and the patch is submitted. Thanks.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, 
> HBASE-14795-4.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14991) Fix the feature warning in scala code


 [ 
https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14991:
---
Status: Patch Available  (was: Open)

> Fix the feature warning in scala code
> -
>
> Key: HBASE-14991
> URL: https://issues.apache.org/jira/browse/HBASE-14991
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14991.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14991) Fix the feature warning in scala code


[ 
https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059531#comment-15059531
 ] 

Zhan Zhang commented on HBASE-14991:


Enable feature option and fix feature warning.

> Fix the feature warning in scala code
> -
>
> Key: HBASE-14991
> URL: https://issues.apache.org/jira/browse/HBASE-14991
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14991.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14991) Fix the feature warning in scala code


[ 
https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059532#comment-15059532
 ] 

Zhan Zhang commented on HBASE-14991:


@Jonathan Hsieh Please review.

> Fix the feature warning in scala code
> -
>
> Key: HBASE-14991
> URL: https://issues.apache.org/jira/browse/HBASE-14991
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14991.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14991) Fix the feature warning in scala code


 [ 
https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14991:
---
Attachment: HBASE-14991.patch

> Fix the feature warning in scala code
> -
>
> Key: HBASE-14991
> URL: https://issues.apache.org/jira/browse/HBASE-14991
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: HBASE-14991.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14991) Fix the feature warning in scala code

Zhan Zhang created HBASE-14991:
--

 Summary: Fix the feature warning in scala code
 Key: HBASE-14991
 URL: https://issues.apache.org/jira/browse/HBASE-14991
 Project: HBase
  Issue Type: Bug
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14991) Fix the feature warning in scala code

Zhan Zhang created HBASE-14991:
--

 Summary: Fix the feature warning in scala code
 Key: HBASE-14991
 URL: https://issues.apache.org/jira/browse/HBASE-14991
 Project: HBase
  Issue Type: Bug
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059509#comment-15059509
 ] 

Zhan Zhang commented on HBASE-14795:


[~jmhsieh] I am trying to figure out how to "re-run with -feature for details". 
Do you know how to do it?

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, 
> HBASE-14795-4.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059502#comment-15059502
 ] 

Zhan Zhang commented on HBASE-14795:


My mistake. These warning are different from those three "feature warning", but 
don't know what build option to enable to get the details of these warnings.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, 
> HBASE-14795-4.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059498#comment-15059498
 ] 

Zhan Zhang commented on HBASE-14795:


[~jmhsieh] My patch in HBASE-14849 will not fix those three warning. But I 
checked again and think these warning are not generated by the patch of this 
JIRA, because the patch does not touch any pom.xml file but the warning is 
actually scoping issue in pom.xml. Correct me if I am wrong. Please following 
for details.

[WARNING]
Artifact org.apache.spark:spark-core_2.10:jar:1.3.0:provided retains 
local artifactScope 'provided' overriding broader artifactScope 'compile'
given by a dependency. If this is not intended, modify or remove the 
local artifactScope.

[WARNING]
Artifact org.scala-lang:scala-library:jar:2.10.4:provided retains local 
artifactScope 'provided' overriding broader artifactScope 'compile'
given by a dependency. If this is not intended, modify or remove the 
local artifactScope.

[WARNING]
Artifact junit:junit:jar:4.12:test retains local artifactScope 'test' 
overriding broader artifactScope 'compile'
given by a dependency. If this is not intended, modify or remove the 
local artifactScope.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>      Issue Type: Improvement
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, 
> HBASE-14795-4.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4445) Unify the term flowId and flowName in timeline v2 codebase


 [ 
https://issues.apache.org/jira/browse/YARN-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated YARN-4445:
-
Attachment: YARN-4445-feature-YARN-2928.001.patch

> Unify the term flowId and flowName in timeline v2 codebase
> --
>
> Key: YARN-4445
> URL: https://issues.apache.org/jira/browse/YARN-4445
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Li Lu
>    Assignee: Zhan Zhang
>  Labels: refactor
> Attachments: YARN-4445-feature-YARN-2928.001.patch, YARN-4445.patch
>
>
> Flow names are not sufficient to identify a flow. I noticed we used both 
> "flowName" and "flowId" to point to the same thing. We need to unify them to 
> flowName. Otherwise, front end users may think flow id is a top level concept 
> and try to directly locate a flow by its flow id.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4445) Unify the term flowId and flowName in timeline v2 codebase


[ 
https://issues.apache.org/jira/browse/YARN-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059212#comment-15059212
 ] 

Zhan Zhang commented on YARN-4445:
--

rename

> Unify the term flowId and flowName in timeline v2 codebase
> --
>
> Key: YARN-4445
> URL: https://issues.apache.org/jira/browse/YARN-4445
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Li Lu
>Assignee: Zhan Zhang
>  Labels: refactor
> Attachments: YARN-4445-feature-YARN-2928.001.patch, YARN-4445.patch
>
>
> Flow names are not sufficient to identify a flow. I noticed we used both 
> "flowName" and "flowId" to point to the same thing. We need to unify them to 
> flowName. Otherwise, front end users may think flow id is a top level concept 
> and try to directly locate a flow by its flow id.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Status: Open  (was: Patch Available)

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
> Attachments: HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Status: Patch Available  (was: Open)

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
> Attachments: HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Attachment: HBASE-14849.patch

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
> Attachments: HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Attachment: (was: HBASE-14849.patch)

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Status: Patch Available  (was: Open)

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
> Attachments: HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions


 [ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14849:
---
Attachment: HBASE-14849.patch

Migrate hbase configuration to SparkConf, and some cleanup.

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
> Attachments: HBASE-14849.patch
>
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4445) Unify the term flowId and flowName in timeline v2 codebase


 [ 
https://issues.apache.org/jira/browse/YARN-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated YARN-4445:
-
Attachment: YARN-4445.patch

> Unify the term flowId and flowName in timeline v2 codebase
> --
>
> Key: YARN-4445
> URL: https://issues.apache.org/jira/browse/YARN-4445
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Li Lu
>    Assignee: Zhan Zhang
>  Labels: refactor
> Attachments: YARN-4445.patch
>
>
> Flow names are not sufficient to identify a flow. I noticed we used both 
> "flowName" and "flowId" to point to the same thing. We need to unify them to 
> flowName. Otherwise, front end users may think flow id is a top level concept 
> and try to directly locate a flow by its flow id.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058997#comment-15058997
 ] 

Zhan Zhang commented on HBASE-14795:


[~jmhsieh] Thanks for bring this up. I am working on HBASE-14849, and also 
doing some cleanup work, which will also fix following warnings.
warning: there were 3 feature warning(s); re-run with -feature for details

Regarding below warning, it is a legacy one and HBASE-14159 has already open 
for it.
warning: Class org.apache.hadoop.mapred.MiniMRCluster not found - continuing 
with a stub.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, 
> HBASE-14795-4.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Spark big rdd problem

2015-12-15 Thread Zhan Zhang

You should be able to get the logs from yarn by “yarn logs -applicationId xxx”, 
where you can possible find the cause.

Thanks.

Zhan Zhang

On Dec 15, 2015, at 11:50 AM, Eran Witkon  wrote:

> When running 
> val data = sc.wholeTextFile("someDir/*") data.count()
> 
> I get numerous warning from yarn till I get aka association exception.
> Can someone explain what happen when spark loads this rdd and can't fit it 
> all in memory?
> Based on the exception it looks like the server is disconnecting from yarn 
> and failing... Any idea why? The code is simple but still failing...
> Eran


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: About Spark On Hbase

2015-12-15 Thread Zhan Zhang

If you want dataframe support, you can refer to https://github.com/zhzhan/shc, 
which I am working on to integrate to HBase upstream with existing support.

Thanks.

Zhan Zhang
On Dec 15, 2015, at 4:34 AM, censj 
mailto:ce...@lotuseed.com>> wrote:


hi,fight fate
Did I can in bulkPut() function use Get value first ,then put this value to 
Hbase ?


在 2015年12月9日，16:02，censj mailto:ce...@lotuseed.com>> 写道：

Thank you! I know
在 2015年12月9日，15:59，fightf...@163.com<mailto:fightf...@163.com> 写道：

If you are using maven , you can add the cloudera maven repo to the repository 
in pom.xml
and add the dependency of spark-hbase.
I just found this : 
http://spark-packages.org/package/nerdammer/spark-hbase-connector
as Feng Dongyu recommend, you can try this also, but I had no experience of 
using this.



fightf...@163.com<mailto:fightf...@163.com>

发件人： censj<mailto:ce...@lotuseed.com>
发送时间： 2015-12-09 15:44
收件人： fightf...@163.com<mailto:fightf...@163.com>
抄送： user@spark.apache.org<mailto:user@spark.apache.org>
主题： Re: About Spark On Hbase
So, I how to get this jar? I use set package project.I not found sbt lib.
在 2015年12月9日，15:42，fightf...@163.com<mailto:fightf...@163.com> 写道：

I don't think it really need CDH component. Just use the API


fightf...@163.com<mailto:fightf...@163.com>

发件人： censj<mailto:ce...@lotuseed.com>
发送时间： 2015-12-09 15:31
收件人： fightf...@163.com<mailto:fightf...@163.com>
抄送： user@spark.apache.org<mailto:user@spark.apache.org>
主题： Re: About Spark On Hbase
But this is dependent on CDH。I not install CDH。
在 2015年12月9日，15:18，fightf...@163.com<mailto:fightf...@163.com> 写道：

Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase
Also, HBASE-13992<https://issues.apache.org/jira/browse/HBASE-13992>  already 
integrates that feature into the hbase side, but
that feature has not been released.

Best,
Sun.


fightf...@163.com<mailto:fightf...@163.com>

From: censj<mailto:ce...@lotuseed.com>
Date: 2015-12-09 15:04
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: About Spark On Hbase
hi all,
 now I using spark,but I not found spark operation hbase open source. Do 
any one tell me?

Re: Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang

I noticed that it is configurable in job level spark.task.cpus.  Anyway to 
support on task level?

Thanks.

Zhan Zhang


On Dec 11, 2015, at 10:46 AM, Zhan Zhang  wrote:

> Hi Folks,
> 
> Is it possible to assign multiple core per task and how? Suppose we have some 
> scenario, in which some tasks are really heavy processing each record and 
> require multi-threading, and we want to avoid similar tasks assigned to the 
> same executors/hosts. 
> 
> If it is not supported, does it make sense to add this feature. It may seems 
> make user worry about more configuration, but by default we can still do 1 
> core per task and only advanced users need to be aware of this feature.
> 
> Thanks.
> 
> Zhan Zhang
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang

I noticed that it is configurable in job level spark.task.cpus.  Anyway to 
support on task level?

Thanks.

Zhan Zhang


On Dec 11, 2015, at 10:46 AM, Zhan Zhang  wrote:

> Hi Folks,
> 
> Is it possible to assign multiple core per task and how? Suppose we have some 
> scenario, in which some tasks are really heavy processing each record and 
> require multi-threading, and we want to avoid similar tasks assigned to the 
> same executors/hosts. 
> 
> If it is not supported, does it make sense to add this feature. It may seems 
> make user worry about more configuration, but by default we can still do 1 
> core per task and only advanced users need to be aware of this feature.
> 
> Thanks.
> 
> Zhan Zhang
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Zhan Zhang

I think you are fetching too many results to the driver. Typically, it is not 
recommended to collect much data to driver. But if you have to, you can 
increase the driver memory, when submitting jobs.

Thanks.

Zhan Zhang

On Dec 11, 2015, at 6:14 AM, Tom Seddon 
mailto:mr.tom.sed...@gmail.com>> wrote:

I have a job that is running into intermittent errors with  [SparkDriver] 
java.lang.OutOfMemoryError: Java heap space.  Before I was getting this error I 
was getting errors saying the result size exceed the 
spark.driver.maxResultSize.  This does not make any sense to me, as there are 
no actions in my job that send data to the driver - just a pull of data from 
S3, a map and reduceByKey and then conversion to dataframe and saveAsTable 
action that puts the results back on S3.

I've found a few references to reduceByKey and spark.driver.maxResultSize 
having some importance, but cannot fathom how this setting could be related.

Would greatly appreciated any advice.

Thanks in advance,

Tom

Re: Performance does not increase as the number of workers increasing in cluster mode

2015-12-11 Thread Zhan Zhang

Not sure your data and model size. But intuitively, there is a tradeoff between 
parallel and network overhead. With the same data set and model, there is a 
optimum point of cluster size (performance may degrade at some point with the 
cluster size increment).  You may want to test larger data set if you wan tot 
do some performance benchmark.

Thanks.

Zhan Zhang



On Dec 11, 2015, at 9:34 AM, Wei Da mailto:xwd0...@qq.com>> 
wrote:

Hi, all

I have done a test in different HW configurations of Spark 1.5.0. A KMeans 
algorithm has been ran in four different Spark environments, the first one ran 
in local mode, the other three ran in cluster mode, all the nodes are with the 
same CPU (6 cores) and Memory (8G). The running times are recorded in the 
following. I thought the performance should increase as the number of workers 
increasing. But the result shows no obvious improvement. Does anybody know the 
reason? Thanks a lot in advance!

The number of rows in test data is about 2.6 million, the input file is about 
810M and stores in HDFS.
[X]


Following is snapshot of the Spark WebUI.
[X]

Wei Da

Wei Da
xwd0...@qq.com<mailto:xwd0...@qq.com>

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Zhan Zhang

As Sean mentioned, you cannot referring to the local file in your remote 
machine (executors). One walk around is to copy the file to all machines within 
same directory.

Thanks.

Zhan Zhang

On Dec 11, 2015, at 10:26 AM, Lin, Hao 
mailto:hao@finra.org>> wrote:

 of the master node

Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang

Hi Folks,

Is it possible to assign multiple core per task and how? Suppose we have some 
scenario, in which some tasks are really heavy processing each record and 
require multi-threading, and we want to avoid similar tasks assigned to the 
same executors/hosts. 

If it is not supported, does it make sense to add this feature. It may seems 
make user worry about more configuration, but by default we can still do 1 core 
per task and only advanced users need to be aware of this feature.

Thanks.

Zhan Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang

Hi Folks,

Is it possible to assign multiple core per task and how? Suppose we have some 
scenario, in which some tasks are really heavy processing each record and 
require multi-threading, and we want to avoid similar tasks assigned to the 
same executors/hosts. 

If it is not supported, does it make sense to add this feature. It may seems 
make user worry about more configuration, but by default we can still do 1 core 
per task and only advanced users need to be aware of this feature.

Thanks.

Zhan Zhang

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Commented] (HBASE-14849) Add option to set block cache to false on SparkSQL executions

2015-12-11 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053169#comment-15053169
 ] 

Zhan Zhang commented on HBASE-14849:


[~ted.m] Please feel free to assign to me.

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>Assignee: Ted Malaska
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14849) Add option to set block cache to false on SparkSQL executions

2015-12-11 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053168#comment-15053168
 ] 

Zhan Zhang commented on HBASE-14849:


I suggest to put this type of configuration into SparkConf, for example 
spark.hbase.blockcache.enable

and we can also migrate existing configurations following the similar way.
spark.hbase.blockcache.size
spark.hbase.batchnum

I also have not think out a good way to test it. One way is to create a new 
hbase default sources dedicated for testing (with buildScan overridden), and 
based on the configuration we return different result to verify the 
configuration is correctly pushed. But it does not test the feature itself.

> Add option to set block cache to false on SparkSQL executions
> -
>
> Key: HBASE-14849
> URL: https://issues.apache.org/jira/browse/HBASE-14849
> Project: HBase
>  Issue Type: New Feature
>  Components: spark
>Reporter: Ted Malaska
>Assignee: Ted Malaska
>
> I was working at a client with a ported down version of the Spark module for 
> HBase and realized we didn't add an option to turn of block cache for the 
> scans.  
> At the client I just disabled all caching with Spark SQL, this is an easy but 
> very impactful fix.
> The fix for this patch will make this configurable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-10 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052024#comment-15052024
 ] 

Zhan Zhang commented on HBASE-14795:


Thanks [~ted.m] and [~ted_yu] for the help. [~ted.m]If you don't mind, please 
share some information regarding your testing and any issue you find.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, 
> HBASE-14795-4.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-10 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051856#comment-15051856
 ] 

Zhan Zhang commented on HBASE-14795:


[~ted.m] Thanks for reviewing this. I have updated the reviewboard with context 
completion hook.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, 
> HBASE-14795-4.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-10 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14795:
---
Attachment: HBASE-14795-4.patch

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, 
> HBASE-14795-4.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-09 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14795:
---
Attachment: HBASE-14795-3.patch

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-08 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14795:
---
Attachment: HBASE-14795-2.patch

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch, HBASE-14795-2.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046297#comment-15046297
 ] 

Zhan Zhang commented on HBASE-14795:


[~malaskat] I forget to publish it. It is available now. Sorry for that.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046013#comment-15046013
 ] 

Zhan Zhang commented on HBASE-14795:


Thanks for reviewing it. I have updated the revised one in the reviewboard.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14789) Enhance the current spark-hbase connector


 [ 
https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14789:
---
Attachment: HBASE-14795-1.patch

solve review comments.

> Enhance the current spark-hbase connector
> -
>
> Key: HBASE-14789
> URL: https://issues.apache.org/jira/browse/HBASE-14789
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: shc.pdf
>
>
> This JIRA is to optimize the RDD construction in the current connector 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14789) Enhance the current spark-hbase connector


 [ 
https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14789:
---
Attachment: (was: HBASE-14795-1.patch)

> Enhance the current spark-hbase connector
> -
>
> Key: HBASE-14789
> URL: https://issues.apache.org/jira/browse/HBASE-14789
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: shc.pdf
>
>
> This JIRA is to optimize the RDD construction in the current connector 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations


 [ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14795:
---
Attachment: HBASE-14795-1.patch

solve review comments

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, 
> HBASE-14795-1.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-03 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15039729#comment-15039729
 ] 

Zhan Zhang commented on HBASE-14795:


Sure. I cannot submit review in review board, and will consult other people how 
to do this.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-03 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14795:
---
Attachment: 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-03 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14795:
---
Status: Patch Available  (was: Open)

Initial patch to consolidate hbase-spark scan operations

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations

2015-11-19 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014494#comment-15014494
 ] 

Zhan Zhang commented on HBASE-14795:


[~malaskat] The work is in progress. I may send out the PR after the holiday, 
as I have to finish some other tasks in parallel. 

I can include 14849 in the PR, or you can go ahead adding the support or wait 
for a while. I am OK with all these options.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Zhan Zhang

When you have following query, 'account=== “acct1” will be pushdown to generate 
new query with “where account = acct1”

Thanks.

Zhan Zhang

On Nov 18, 2015, at 11:36 AM, Eran Medan 
mailto:eran.me...@gmail.com>> wrote:

I understand that the following are equivalent

df.filter('account === "acct1")

sql("select * from tempTableName where account = 'acct1'")


But is Spark SQL "smart" to also push filter predicates down for the initial 
load?

e.g.
sqlContext.read.jdbc(…).filter('account=== "acct1")

Is Spark "smart enough" to this for each partition?

   ‘select … where account= ‘acc1’ AND (partition where clause here)?

Or do I have to put it on each partition where clause otherwise it will load 
the entire set and only then filter it in memory?

[https://mailfoogae.appspot.com/t?sender=aZWhyYW5uLm1laGRhbkBnbWFpbC5jb20%3D&type=zerocontent&guid=4e81181c-98d1-4dac-b047-a4c9e7d864d9]ᐧ

[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join

2015-11-14 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005682#comment-15005682
 ] 

Zhan Zhang commented on SPARK-11704:


[~maropu] You are right. I mean fetching from network is a big overhead. Feel 
free to work on it.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11704) Optimize the Cartesian Join


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005136#comment-15005136
 ] 

Zhan Zhang edited comment on SPARK-11704 at 11/14/15 5:16 AM:
--

I think we can add a register and a cleanup hook in query context. Before the 
query is performed, the register handlers are invoked (such as persist), and 
cleanup hooks are invoked (e.g., unpersist) after the query is done. By this 
way, the CartesianProduct we can cache the rightResult in the registered 
handler and unpersist the rightResult after the query. By this way, we avoid 
the recomputation of RDD2. 

In my testing, because rdd2 is quite small, I actually reverse the cartesian 
join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, 
the computation is done quite fast, but the original form cannot be finished.


was (Author: zzhan):
I think we can add a cleanup hook in SQLContext, and when the query is done, we 
invoke all the cleanup hook registered. By this way, the CartesianProduct we 
can cache the rightResult and register the cleanup handler(unpersist). By this 
way, we avoid the recomputation of RDD2. 

In my testing, because rdd2 is quite small, I actually reverse the cartesian 
join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, 
the computation is done quite fast, but the original form cannot be finished.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005136#comment-15005136
 ] 

Zhan Zhang commented on SPARK-11704:


I think we can add a cleanup hook in SQLContext, and when the query is done, we 
invoke all the cleanup hook registered. By this way, the CartesianProduct we 
can cache the rightResult and register the cleanup handler(unpersist). By this 
way, we avoid the recomputation of RDD2. 

In my testing, because rdd2 is quite small, I actually reverse the cartesian 
join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, 
the computation is done quite fast, but the original form cannot be finished.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005134#comment-15005134
 ] 

Zhan Zhang commented on SPARK-11704:


[~maropu] Maybe I misunderstand. If RDD2 is coming from ShuffleRDD, each new 
iterator will try to fetch from network because RDD2 is not cached. Is the 
ShuffleRDD cached automatically?

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11705) Eliminate unnecessary Cartesian Join


[ 
https://issues.apache.org/jira/browse/SPARK-11705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004744#comment-15004744
 ] 

Zhan Zhang commented on SPARK-11705:


simple reproduce step:
import sqlContext.implicits._
case class SimpleRecord(key: Int, value: String)
def withDF(name: String) = {
  val df =  sc.parallelize((0 until 10).map(x => SimpleRecord(x, 
s"record_$x"))).toDF()
  df.registerTempTable(name)
}
withDF("p")
withDF("s")
withDF("l")

val d = sqlContext.sql(s"select p.key, p.value, s.value, l.value from p, s, l 
where l.key = s.key and p.key = l.key")
d.queryExecution.sparkPlan

res15: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [key#0,value#1,value#3,value#5]
 SortMergeJoin [key#2,key#0], [key#4,key#4]
  CartesianProduct
   Scan PhysicalRDD[key#0,value#1]
   Scan PhysicalRDD[key#2,value#3]
  Scan PhysicalRDD[key#4,value#5]


val d1 = sqlContext.sql(s"select p.key, p.value, s.value, l.value from s, l, p 
where l.key = s.key and p.key = l.key")
d1.queryExecution.sparkPlan

res16: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [key#0,value#1,value#3,value#5]
 SortMergeJoin [key#4], [key#0]
  TungstenProject [key#4,value#5,value#3]
   SortMergeJoin [key#2], [key#4]
Scan PhysicalRDD[key#2,value#3]
Scan PhysicalRDD[key#4,value#5]
  Scan PhysicalRDD[key#0,value#1]

> Eliminate unnecessary Cartesian Join
> 
>
> Key: SPARK-11705
> URL: https://issues.apache.org/jira/browse/SPARK-11705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> When we have some queries similar to following (don’t remember the exact 
> form):
> select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and c.key3 
> = d.key3
> There will be a cartesian join between a and b. But if we just simply change 
> the table order, for example from a, c, b, d, such cartesian join are 
> eliminated.
> Without such manual tuning, the query will never finish if a, b are big. But 
> we should not relies on such manual optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

Zhan Zhang created HBASE-14801:
--

 Summary: Enhance the Spark-HBase connector catalog with json format
 Key: HBASE-14801
 URL: https://issues.apache.org/jira/browse/HBASE-14801
 Project: HBase
  Issue Type: Improvement
Reporter: Zhan Zhang
Assignee: Zhan Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

Zhan Zhang created HBASE-14801:
--

 Summary: Enhance the Spark-HBase connector catalog with json format
 Key: HBASE-14801
 URL: https://issues.apache.org/jira/browse/HBASE-14801
 Project: HBase
  Issue Type: Improvement
Reporter: Zhan Zhang
Assignee: Zhan Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations


 [ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14795:
---
Summary: Enhance the spark-hbase scan operations  (was: Provide an 
alternative spark-hbase SQL implementations for Scan)

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14796) Enhance the Gets in the connector


 [ 
https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14796:
---
Summary: Enhance the Gets in the connector  (was: Provide an alternative 
spark-hbase SQL implementations for Gets)

> Enhance the Gets in the connector
> -
>
> Key: HBASE-14796
> URL: https://issues.apache.org/jira/browse/HBASE-14796
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>    Assignee: Zhan Zhang
>Priority: Minor
>
> Current the Spark-Module Spark SQL implementation gets records from HBase 
> from the driver if there is something like the following found in the SQL.
> rowkey = 123
> The reason for this original was normal sql will not have many equal 
> operations in a single where clause.
> Zhan, had brought up too points that have value.
> 1. The SQL may be generated and may have many many equal statements in it so 
> moving the work to an executor protects the driver from load
> 2. In the correct implementation the drive is connecting to HBase and 
> exceptions may cause trouble with the Spark application and not just with the 
> a single task execution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14789) Enhance the current spark-hbase connector


 [ 
https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14789:
---
Description: This JIRA is to optimize the RDD construction in the current 
connector implementation.  (was: This JIRA is to provide user an option to 
choose different Spark-HBase implementation based on requirements.)

> Enhance the current spark-hbase connector
> -
>
> Key: HBASE-14789
> URL: https://issues.apache.org/jira/browse/HBASE-14789
> Project: HBase
>  Issue Type: Improvement
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: shc.pdf
>
>
> This JIRA is to optimize the RDD construction in the current connector 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14789) Enhance the current spark-hbase connector