[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Attachment: HBASE-14801-3.patch > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch, > HBASE-14801-3.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Status: Patch Available (was: In Progress) > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Status: In Progress (was: Patch Available) > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Attachment: HBASE-14801-2.patch > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Attachment: (was: HBASE-14801-2.patch) > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: ORC file writing hangs in pyspark
Hi James, You can try to write with other format, e.g., parquet to see whether it is a orc specific issue or more generic issue. Thanks. Zhan Zhang On Feb 23, 2016, at 6:05 AM, James Barney mailto:jamesbarne...@gmail.com>> wrote: I'm trying to write an ORC file after running the FPGrowth algorithm on a dataset of around just 2GB in size. The algorithm performs well and can display results if I take(n) the freqItemSets() of the result after converting that to a DF. I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn. I get the results from querying a Hive table, also ORC format, running a number of maps, joins, and filters on the data. When the program attempts to write the files: result.write.orc('/data/staged/raw_result') size_1_buckets.write.orc('/data/staged/size_1_results') filter_size_2_buckets.write.orc('/data/staged/size_2_results') The first path, /data/staged/raw_result, is created with a _temporary folder, but the data is never written. The job hangs at this point, apparently indefinitely. Additionally, no logs are recorded or available for the jobs on the history server. What could be the problem?
[jira] [Commented] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159806#comment-15159806 ] Zhan Zhang commented on HBASE-14801: Will update the scoreboard after the sanity test by server. > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement >Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Attachment: HBASE-14801-2.patch > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems
[ https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125205#comment-15125205 ] Zhan Zhang commented on SPARK-7009: --- Yes. This one is obsoleted. > Build assembly JAR via ant to avoid zip64 problems > -- > > Key: SPARK-7009 > URL: https://issues.apache.org/jira/browse/SPARK-7009 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.0 > Environment: Java 7+ >Reporter: Steve Loughran > Attachments: check_spark_python.sh > > Original Estimate: 2h > Remaining Estimate: 2h > > SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a > format incompatible with Java and pyspark. > Provided the total number of .class files+resources is <64K, ant can be used > to make the final JAR instead, perhaps by unzipping the maven-generated JAR > then rezipping it with zip64=never, before publishing the artifact via maven. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118266#comment-15118266 ] Zhan Zhang commented on HBASE-14801: Looks like most of warning does not apply to this patch. I will update the patch after collecting more feedback. > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement >Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-11075) Spark SQL Thrift Server authentication issue on kerberized yarn cluster
[ https://issues.apache.org/jira/browse/SPARK-11075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113560#comment-15113560 ] Zhan Zhang commented on SPARK-11075: Duplicated to SPARK-5159? > Spark SQL Thrift Server authentication issue on kerberized yarn cluster > > > Key: SPARK-11075 > URL: https://issues.apache.org/jira/browse/SPARK-11075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0, 1.5.1 > Environment: hive-1.2.1 > hadoop-2.6.0 config kerbers >Reporter: Xiaoyu Wang > > Use proxy user connect to the thrift server by beeline but got permission > exception: > 1.Start the hive 1.2.1 metastore with user hive > {code} > $kinit -kt /tmp/hive.keytab hive/xxx > $nohup ./hive --service metastore 2>&1 >> ../logs/metastore.log & > {code} > 2.Start the spark thrift server with user hive > {code} > $kinit -kt /tmp/hive.keytab hive/xxx > $./start-thriftserver.sh --master yarn > {code} > 3.Connect to the thrift server with proxy user hive01 > {code} > $kinit hive01 > beeline command:!connect > jdbc:hive2://xxx:1/default;principal=hive/x...@hadoop.com;kerberosAuthType=kerberos;hive.server2.proxy.user=hive01 > {code} > 4.Create table and insert data > {code} > create table test(name string); > insert overwrite table test select * from sometable; > {code} > the insert sql got exception: > {noformat} > Error: org.apache.hadoop.security.AccessControlException: Permission denied: > user=hive01, access=WRITE, > inode="/user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1/_temporary/0/task_201510100917_0003_m_00":hive:hadoop:drwxr-xr-x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:182) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6512) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3775) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3739) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rename(NameNodeRpcServer.java:754) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.rename(ClientNamenodeProtocolServerSideTranslatorPB.java:565) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > (state=,code=0) > {noformat} > The table path on HDFS: > {noformat} > drwxrwxrwx - hive hadoop 0 2015-10-10 09:14 > /user/hive/warehouse/test > drwxrwxrwx - hive01 hadoop 0 2015-10-10 09:17 > /user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2 > drwxr-xr-x - hive01 hadoop 0 2015-10-10 09:17 > /user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1 > drwxr-xr-x - hive01 hadoop 0 2015-10-10 09:17 > /user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1/_temporary > drwxr-xr-x - hive01 hadoop 0 2015-10-10 09:17 > /user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1/_t
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113469#comment-15113469 ] Zhan Zhang commented on SPARK-5159: --- [~luciano resende] Given the current code base, I don't think impersonation works only I miss something. In you case, you may want to verify who are accessing the hdfs, it is driver, or executor? You may retry the case with the other way (with right permission to see whether executor can access the file correctly). Currently, driver does have impersonation if configured, but executors does not support it. > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Status: Patch Available (was: Open) > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Attachment: HBASE-14801-1.patch > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102183#comment-15102183 ] Zhan Zhang edited comment on SPARK-5159 at 1/15/16 5:50 PM: What happen if an user have a valid visit to a table, which will be saved in catalog. Another user then also can visit the table as it is cached in local hivecatalog, even if the latter does not have the access to the table meta data, right? To make the impersonate to work, all the information has to be tagged by user, right? was (Author: zzhan): What happen if an user have a valid visit to a table, which will be saved in catalog. Another user then also can visit the table as it is cached in local hivecatalog, even if the latter does not have the access to the table, right? To make the impersonate to really work, all the information has to be tagged by user, right? > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102183#comment-15102183 ] Zhan Zhang commented on SPARK-5159: --- What happen if an user have a valid visit to a table, which will be saved in catalog. Another user then also can visit the table as it is cached in local hivecatalog, even if the latter does not have the access to the table, right? To make the impersonate to really work, all the information has to be tagged by user, right? > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098734#comment-15098734 ] Zhan Zhang commented on SPARK-5159: --- This issue is definitely broken. But fixing it needs a complete design being review first. For example, to enable the impersonation (doAs) at runtime, how do we solve the RDD sharing between different users? We can propagate the user to the executor piggybacked by TaskDescription. But what happen if two user operate on two RDDs which share the same parent, cache created by another user. Currently, RDD scope is SparkContext without any user information. It means even we do impersonation, it is meaningless per my understanding. > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Review Request 42118: AMBARI-14601 Disable impersonation in spark hive support
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/42118/ --- Review request for Ambari and Robert Levas. Bugs: AMBARI-14601 https://issues.apache.org/jira/browse/AMBARI-14601 Repository: ambari Description --- Currently spark thriftserver cannot do impersonation correctly. We have to disable this feature. Diffs - ambari-server/src/main/resources/stacks/HDP/2.3/services/SPARK/configuration/spark-hive-site-override.xml 8f0bc62 Diff: https://reviews.apache.org/r/42118/diff/ Testing --- Manual test is done, and it works as expected. Without the patch, we hit the file permission issue. Disabling the impersonation fix the issue as below at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) Thanks, Zhan Zhang
[jira] [Updated] (AMBARI-14601) Disable impersonation in spark
[ https://issues.apache.org/jira/browse/AMBARI-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated AMBARI-14601: Attachment: AMBARI-14601.patch set hive.server2.enable.doAs to false > Disable impersonation in spark > -- > > Key: AMBARI-14601 > URL: https://issues.apache.org/jira/browse/AMBARI-14601 > Project: Ambari > Issue Type: Bug > Reporter: Zhan Zhang > Attachments: AMBARI-14601.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AMBARI-14601) Disable impersonation in spark
[ https://issues.apache.org/jira/browse/AMBARI-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091161#comment-15091161 ] Zhan Zhang commented on AMBARI-14601: - Currently spark thriftserver cannot do impersonation correctly. We have to disable this feature. > Disable impersonation in spark > -- > > Key: AMBARI-14601 > URL: https://issues.apache.org/jira/browse/AMBARI-14601 > Project: Ambari > Issue Type: Bug >Reporter: Zhan Zhang > Attachments: AMBARI-14601.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AMBARI-14601) Disable impersonation in spark
Zhan Zhang created AMBARI-14601: --- Summary: Disable impersonation in spark Key: AMBARI-14601 URL: https://issues.apache.org/jira/browse/AMBARI-14601 Project: Ambari Issue Type: Bug Reporter: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Dr.appointment this afternoon and WFH tomorrow for another Dr. appointment (EOM)
- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Commented] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086414#comment-15086414 ] Zhan Zhang commented on HBASE-14801: I will start to working on this. Please let me know if anyone have any concerns or comments. > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Improvement >Reporter: Zhan Zhang > Assignee: Zhan Zhang > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14796) Enhance the Gets in the connector
[ https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14796: --- Attachment: HBASE-14796-1.patch solve review comments > Enhance the Gets in the connector > - > > Key: HBASE-14796 > URL: https://issues.apache.org/jira/browse/HBASE-14796 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14796-1.patch, HBASE-14976.patch > > > Current the Spark-Module Spark SQL implementation gets records from HBase > from the driver if there is something like the following found in the SQL. > rowkey = 123 > The reason for this original was normal sql will not have many equal > operations in a single where clause. > Zhan, had brought up too points that have value. > 1. The SQL may be generated and may have many many equal statements in it so > moving the work to an executor protects the driver from load > 2. In the correct implementation the drive is connecting to HBase and > exceptions may cause trouble with the Spark application and not just with the > a single task execution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Problem using limit clause in spark sql
There has to have a central point to collaboratively collecting exactly 1 records, currently the approach is using one single partitions, which is easy to implement. Otherwise, the driver has to count the number of records in each partition and then decide how many records to be materialized in each partition, because some partition may not have enough number of records, sometimes it is even empty. I didn’t see any straightforward walk around for this. Thanks. Zhan Zhang On Dec 23, 2015, at 5:32 PM, 汪洋 mailto:tiandiwo...@icloud.com>> wrote: It is an application running as an http server. So I collect the data as the response. 在 2015年12月24日,上午8:22,Hudong Wang mailto:justupl...@hotmail.com>> 写道: When you call collect() it will bring all the data to the driver. Do you mean to call persist() instead? From: tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com> Subject: Problem using limit clause in spark sql Date: Wed, 23 Dec 2015 21:26:51 +0800 To: user@spark.apache.org<mailto:user@spark.apache.org> Hi, I am using spark sql in a way like this: sqlContext.sql(“select * from table limit 1”).map(...).collect() The problem is that the limit clause will collect all the 10,000 records into a single partition, resulting the map afterwards running only in one partition and being really slow.I tried to use repartition, but it is kind of a waste to collect all those records into one partition and then shuffle them around and then collect them again. Is there a way to work around this? BTW, there is no order by clause and I do not care which 1 records I get as long as the total number is less or equal then 1.
[jira] [Commented] (HBASE-14796) Enhance the Gets in the connector
[ https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070286#comment-15070286 ] Zhan Zhang commented on HBASE-14796: Thanks [~ted.m] for the quick review. It is reasonable to have a performance test, and I will try to grab some physical cluster for it. It may take some time, as I don't have physical cluster for this. On the other hand, I do think we should change it to perform BulkGet in executors regardless the performance (although I think it should improve the performance instead of the other way), because: 1. Current implementation do gather-scatter in driver, which would increase network overhead and latency if the number of gets is big. 2. Failure recovery. It is hard to do failure recovery as it is performed in driver, which is single point of failure. The above two have been discussed in details. But I just realized there is another potential issue, which the current implementation may be against Spark SQL engine design as below. 3. Currently, the bulkGet is happening in the query plan (buildScan), and the results will stay in driver (1st). The result is distributed to executors in query execution(2nd). 3.1 1st and 2nd are not always happening in pair. Even worse, sometimes only 1st is happening, for example, users do plan.explain, but may never trigger the plan execution. 3.2 Memory taken by table.get may never get released in driver, increase the driver memory overhead. [~ted.m] Please let me know how do you think, and correct me if my understanding is wrong. > Enhance the Gets in the connector > - > > Key: HBASE-14796 > URL: https://issues.apache.org/jira/browse/HBASE-14796 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska >Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14976.patch > > > Current the Spark-Module Spark SQL implementation gets records from HBase > from the driver if there is something like the following found in the SQL. > rowkey = 123 > The reason for this original was normal sql will not have many equal > operations in a single where clause. > Zhan, had brought up too points that have value. > 1. The SQL may be generated and may have many many equal statements in it so > moving the work to an executor protects the driver from load > 2. In the correct implementation the drive is connecting to HBase and > exceptions may cause trouble with the Spark application and not just with the > a single task execution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Unable to create hive table using HiveContext
You are using embedded mode, which will create the db locally (in your case, maybe the db has been created, but you do not have right permission?). To connect to remote metastore, hive-site.xml has to be correctly configured. Thanks. Zhan Zhang On Dec 23, 2015, at 7:24 AM, Soni spark mailto:soni2015.sp...@gmail.com>> wrote: Hi friends, I am trying to create hive table through spark with Java code in Eclipse using below code. HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc<http://sc.sc/>()); sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)"); but i am getting error RROR XBM0J: Directory /home/workspace4/Test/metastore_db already exists. I am not sure why metastore creating in workspace. Please help me. Thanks Soniya
Re: DataFrameWriter.format(String) is there a list of options?
Now json, parquet, orc(in hivecontext), text are natively supported. If you use avro or others, you have to include the package, which are not built into spark jar. Thanks. Zhan Zhang On Dec 23, 2015, at 8:57 AM, Christopher Brady mailto:christopher.br...@oracle.com>> wrote: DataFrameWriter.format
[jira] [Updated] (HBASE-14796) Enhance the Gets in the connector
[ https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14796: --- Release Note: spark.hbase.bulkGetSize in HBaseSparkConf is for grouping bulkGet, and default value is 1000. Status: Patch Available (was: Open) > Enhance the Gets in the connector > - > > Key: HBASE-14796 > URL: https://issues.apache.org/jira/browse/HBASE-14796 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14976.patch > > > Current the Spark-Module Spark SQL implementation gets records from HBase > from the driver if there is something like the following found in the SQL. > rowkey = 123 > The reason for this original was normal sql will not have many equal > operations in a single where clause. > Zhan, had brought up too points that have value. > 1. The SQL may be generated and may have many many equal statements in it so > moving the work to an executor protects the driver from load > 2. In the correct implementation the drive is connecting to HBase and > exceptions may cause trouble with the Spark application and not just with the > a single task execution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14796) Enhance the Gets in the connector
[ https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14796: --- Attachment: HBASE-14976.patch We have use case where bulkget may consists of thousands of gets. Move BulkGet to executor side from driver, which will improve the failure recovery, and potentially improve the performance as well when the gets number is big. > Enhance the Gets in the connector > - > > Key: HBASE-14796 > URL: https://issues.apache.org/jira/browse/HBASE-14796 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14976.patch > > > Current the Spark-Module Spark SQL implementation gets records from HBase > from the driver if there is something like the following found in the SQL. > rowkey = 123 > The reason for this original was normal sql will not have many equal > operations in a single where clause. > Zhan, had brought up too points that have value. > 1. The SQL may be generated and may have many many equal statements in it so > moving the work to an executor protects the driver from load > 2. In the correct implementation the drive is connecting to HBase and > exceptions may cause trouble with the Spark application and not just with the > a single task execution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Can SqlContext be used inside mapPartitions
SQLContext is in driver side, and I don’t think you can use it in executors. How to provide lookup functionality in executors really depends on how you would use them. Thanks. Zhan Zhang On Dec 22, 2015, at 4:44 PM, SRK wrote: > Hi, > > Can SQL Context be used inside mapPartitions? My requirement is to register > a set of data from hdfs as a temp table and to be able to lookup from inside > MapPartitions based on a key. If it is not supported, is there a different > way of doing this? > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-SqlContext-be-used-inside-mapPartitions-tp25771.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-submit is ignoring "--executor-cores"
BTW: It is not only a Yarn-webui issue. In capacity scheduler, vcore is ignored. If you want Yarn to honor vcore requests, you have to use DominantResourceCalculator as Saisai suggested. Thanks. Zhan Zhang On Dec 21, 2015, at 5:30 PM, Saisai Shao mailto:sai.sai.s...@gmail.com>> wrote: and you'll see the right vcores y
Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour
This looks to me is a very unusual use case. You stop the SparkContext, and start another one. I don’t think it is well supported. As the SparkContext is stopped, all the resources are supposed to be released. Is there any mandatory reason you have stop and restart another SparkContext. Thanks. Zhan Zhang Note that when sc is stopped, all resources are released (for example in yarn On Dec 20, 2015, at 2:59 PM, Jerry Lam wrote: > Hi Spark developers, > > I found that SQLContext.getOrCreate(sc: SparkContext) does not behave > correctly when a different spark context is provided. > > ``` > val sc = new SparkContext > val sqlContext =SQLContext.getOrCreate(sc) > sc.stop > ... > > val sc2 = new SparkContext > val sqlContext2 = SQLContext.getOrCreate(sc2) > sc2.stop > ``` > > The sqlContext2 will reference sc instead of sc2 and therefore, the program > will not work because sc has been stopped. > > Best Regards, > > Jerry - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark with log4j
Hi Kalpesh, If you are using spark on yarn, it may not work. Because you write log to files other than stdout/stderr, which yarn log aggregation may not work. As I understand, yarn only aggregate log in stdout/stderr, and local cache will be deleted (in configured timeframe). To check it, at application run time, you can log into the container’s box, and check the local cache of the container to find whether the log file exists or not (after app terminate, these local cache files will be deleted as well). Thanks. Zhan Zhang On Dec 18, 2015, at 7:23 AM, Kalpesh Jadhav mailto:kalpesh.jad...@citiustech.com>> wrote: Hi all, I am new to spark, I am trying to use log4j for logging my application. But any how the logs are not getting written at specified file. I have created application using maven, and kept log.properties file at resources folder. Application written in scala . If there is any alternative instead of log4j then also it will work, but I wanted to see logs in file. If any changes need to be done in hortonworks<https://www.google.co.in/search?client=firefox-a&rls=org.mozilla:en-US:official&channel=fflb&q=hortonworks&spell=1&sa=X&ved=0ahUKEwj5k4Gq2-XJAhXUB44KHYU-C6MQvwUIGSgA> for spark configuration, please mentioned that as well. If anyone has done before or on github any source available please respond. Thanks, Kalpesh Jadhav === DISCLAIMER: The information contained in this message (including any attachments) is confidential and may be privileged. If you have received it by mistake please notify the sender by return e-mail and permanently delete this message and any attachments from your system. Any dissemination, use, review, distribution, printing or copying of this message in whole or in part is strictly prohibited. Please note that e-mails are susceptible to change. CitiusTech shall not be liable for the improper or incomplete transmission of the information contained in this communication nor for any delay in its receipt or damage to your system. CitiusTech does not guarantee that the integrity of this communication has been maintained or that this communication is free of viruses, interceptions or interferences.
Re: number limit of map for spark
What I mean is to combine multiple map functions into one. Don’t know how exactly your algorithms works. Did your one iteration result depend on last iteration? If so, how do they depend on? I think either you can optimize your implementation, or Spark is not the right one for your specific application. Thanks. Zhan Zhang On Dec 21, 2015, at 10:43 AM, Zhiliang Zhu mailto:zchl.j...@yahoo.com.INVALID>> wrote: What is difference between repartition / collect and collapse ... Is collapse the same costly as collect or repartition ? Thanks in advance ~ On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang mailto:zzh...@hortonworks.com>> wrote: In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collect all data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu mailto:zchl.j...@yahoo.com.INVALID>> wrote: Dear All, I need to iterator some job / rdd quite a lot of times, but just lost in the problem of spark only accept to call around 350 number of map before it meets one action Function , besides, dozens of action will obviously increase the run time. Is there any proper way ... As tested, there is piece of codes as follows: .. 83 int count = 0; 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with only 1 partition 85 int m = 350; 86 JavaRDD r = dataSet.cache(); 87 JavaRDD t = null; 88 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 90 if(null != t) { 91 r = t; 92 } //inner loop to call map 350 times , if m is much more than 350 (for instance, around 400), then the job will throw exception message "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError java.lang.StackOverflowError") 93 for(int i=0; i < m; ++i) { 94 r = r.map(new Function() { 95 @Override 96 public Integer call(Integer integer) { 97 double x = Math.random() * 2 - 1; 98 double y = Math.random() * 2 - 1; 99 return (x * x + y * y < 1) ? 1 : 0; 100 } 101 }); 104 } 105 106 List lt = r.collect(); //then collect this rdd to get another rdd, however, dozens of action Function as collect is VERY MUCH COST 107 t = jsc.parallelize(lt, 1).cache(); 108 109 } 110 .. Thanks very much in advance! Zhiliang
Re: number limit of map for spark
In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collect all data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu mailto:zchl.j...@yahoo.com.INVALID>> wrote: Dear All, I need to iterator some job / rdd quite a lot of times, but just lost in the problem of spark only accept to call around 350 number of map before it meets one action Function , besides, dozens of action will obviously increase the run time. Is there any proper way ... As tested, there is piece of codes as follows: .. 83 int count = 0; 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with only 1 partition 85 int m = 350; 86 JavaRDD r = dataSet.cache(); 87 JavaRDD t = null; 88 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 90 if(null != t) { 91 r = t; 92 } //inner loop to call map 350 times , if m is much more than 350 (for instance, around 400), then the job will throw exception message "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError java.lang.StackOverflowError") 93 for(int i=0; i < m; ++i) { 94 r = r.map(new Function() { 95 @Override 96 public Integer call(Integer integer) { 97 double x = Math.random() * 2 - 1; 98 double y = Math.random() * 2 - 1; 99 return (x * x + y * y < 1) ? 1 : 0; 100 } 101 }); 104 } 105 106 List lt = r.collect(); //then collect this rdd to get another rdd, however, dozens of action Function as collect is VERY MUCH COST 107 t = jsc.parallelize(lt, 1).cache(); 108 109 } 110 .. Thanks very much in advance! Zhiliang
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Release Note: For user configurable parameters for HBase datasources. Please refer to org.apache.hadoop.hbase.spark.datasources.HBaseSparkConf for details. User can either set them in SparkConf, which will take effect globally, or configure it per table, which will overwrite the value set in SparkConf. If not set, the default value will take effect. Currently three parameters are supported. 1. spark.hbase.blockcache.enable for blockcache enable/disable. Default is enable, but note that this potentially may slow down the system. 2. spark.hbase.cacheSize for cache size when performing HBase table scan. Default value is 1000 3. spark.hbase.batchNum for the batch number when performing HBase table scan. Default value is 1000. > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > Attachments: HBASE-14849-1.patch, HBASE-14849-2.patch, > HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064656#comment-15064656 ] Zhan Zhang commented on HBASE-14849: [~ted_yu] Not very familiar with it. Could you clarify which doc I need to update? Thanks. > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska >Assignee: Zhan Zhang > Attachments: HBASE-14849-1.patch, HBASE-14849-2.patch, > HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Attachment: HBASE-14849-2.patch Solve review comments. > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > Attachments: HBASE-14849-1.patch, HBASE-14849-2.patch, > HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Attachment: HBASE-14849-1.patch fix style check. The javadoc warning is not related to this jira. > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > Attachments: HBASE-14849-1.patch, HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14991) Fix the feature warning in scala code
[ https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14991: --- Attachment: HBASE-14991-1.patch Attach the same file to kick off the testing > Fix the feature warning in scala code > - > > Key: HBASE-14991 > URL: https://issues.apache.org/jira/browse/HBASE-14991 > Project: HBase > Issue Type: Bug > Reporter: Zhan Zhang > Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14991-1.patch, HBASE-14991.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Spark big rdd problem
There are two cases here. If the container is killed by yarn, you can increase jvm overhead. Otherwise, you have to increase the executor-memory if there is no memory leak happening. Thanks. Zhan Zhang On Dec 15, 2015, at 9:58 PM, Eran Witkon mailto:eranwit...@gmail.com>> wrote: If the problem is containers trying to use more memory then they allowed, how do I limit them? I all ready have executor-memory 5G Eran On Tue, 15 Dec 2015 at 23:10 Zhan Zhang mailto:zzh...@hortonworks.com>> wrote: You should be able to get the logs from yarn by “yarn logs -applicationId xxx”, where you can possible find the cause. Thanks. Zhan Zhang On Dec 15, 2015, at 11:50 AM, Eran Witkon mailto:eranwit...@gmail.com>> wrote: > When running > val data = sc.wholeTextFile("someDir/*") data.count() > > I get numerous warning from yarn till I get aka association exception. > Can someone explain what happen when spark loads this rdd and can't fit it > all in memory? > Based on the exception it looks like the server is disconnecting from yarn > and failing... Any idea why? The code is simple but still failing... > Eran
[jira] [Commented] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059555#comment-15059555 ] Zhan Zhang commented on HBASE-14849: I use following command, but didn't find any javadoc warning. Will fix other issue after gathering review comments. mvn clean package javadoc:javadoc -DskipTests -DHBasePatchProcess > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska >Assignee: Zhan Zhang > Attachments: HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059533#comment-15059533 ] Zhan Zhang commented on HBASE-14795: [~jmhsieh] HBASE-14991 is opened for this, and the patch is submitted. Thanks. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.0.0 > > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, > HBASE-14795-4.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14991) Fix the feature warning in scala code
[ https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14991: --- Status: Patch Available (was: Open) > Fix the feature warning in scala code > - > > Key: HBASE-14991 > URL: https://issues.apache.org/jira/browse/HBASE-14991 > Project: HBase > Issue Type: Bug > Reporter: Zhan Zhang > Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14991.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14991) Fix the feature warning in scala code
[ https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059531#comment-15059531 ] Zhan Zhang commented on HBASE-14991: Enable feature option and fix feature warning. > Fix the feature warning in scala code > - > > Key: HBASE-14991 > URL: https://issues.apache.org/jira/browse/HBASE-14991 > Project: HBase > Issue Type: Bug >Reporter: Zhan Zhang > Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14991.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14991) Fix the feature warning in scala code
[ https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059532#comment-15059532 ] Zhan Zhang commented on HBASE-14991: @Jonathan Hsieh Please review. > Fix the feature warning in scala code > - > > Key: HBASE-14991 > URL: https://issues.apache.org/jira/browse/HBASE-14991 > Project: HBase > Issue Type: Bug >Reporter: Zhan Zhang > Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14991.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14991) Fix the feature warning in scala code
[ https://issues.apache.org/jira/browse/HBASE-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14991: --- Attachment: HBASE-14991.patch > Fix the feature warning in scala code > - > > Key: HBASE-14991 > URL: https://issues.apache.org/jira/browse/HBASE-14991 > Project: HBase > Issue Type: Bug > Reporter: Zhan Zhang > Assignee: Zhan Zhang >Priority: Minor > Attachments: HBASE-14991.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14991) Fix the feature warning in scala code
Zhan Zhang created HBASE-14991: -- Summary: Fix the feature warning in scala code Key: HBASE-14991 URL: https://issues.apache.org/jira/browse/HBASE-14991 Project: HBase Issue Type: Bug Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14991) Fix the feature warning in scala code
Zhan Zhang created HBASE-14991: -- Summary: Fix the feature warning in scala code Key: HBASE-14991 URL: https://issues.apache.org/jira/browse/HBASE-14991 Project: HBase Issue Type: Bug Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059509#comment-15059509 ] Zhan Zhang commented on HBASE-14795: [~jmhsieh] I am trying to figure out how to "re-run with -feature for details". Do you know how to do it? > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska >Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.0.0 > > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, > HBASE-14795-4.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059502#comment-15059502 ] Zhan Zhang commented on HBASE-14795: My mistake. These warning are different from those three "feature warning", but don't know what build option to enable to get the details of these warnings. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska >Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.0.0 > > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, > HBASE-14795-4.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059498#comment-15059498 ] Zhan Zhang commented on HBASE-14795: [~jmhsieh] My patch in HBASE-14849 will not fix those three warning. But I checked again and think these warning are not generated by the patch of this JIRA, because the patch does not touch any pom.xml file but the warning is actually scoping issue in pom.xml. Correct me if I am wrong. Please following for details. [WARNING] Artifact org.apache.spark:spark-core_2.10:jar:1.3.0:provided retains local artifactScope 'provided' overriding broader artifactScope 'compile' given by a dependency. If this is not intended, modify or remove the local artifactScope. [WARNING] Artifact org.scala-lang:scala-library:jar:2.10.4:provided retains local artifactScope 'provided' overriding broader artifactScope 'compile' given by a dependency. If this is not intended, modify or remove the local artifactScope. [WARNING] Artifact junit:junit:jar:4.12:test retains local artifactScope 'test' overriding broader artifactScope 'compile' given by a dependency. If this is not intended, modify or remove the local artifactScope. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska >Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.0.0 > > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, > HBASE-14795-4.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4445) Unify the term flowId and flowName in timeline v2 codebase
[ https://issues.apache.org/jira/browse/YARN-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated YARN-4445: - Attachment: YARN-4445-feature-YARN-2928.001.patch > Unify the term flowId and flowName in timeline v2 codebase > -- > > Key: YARN-4445 > URL: https://issues.apache.org/jira/browse/YARN-4445 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu > Assignee: Zhan Zhang > Labels: refactor > Attachments: YARN-4445-feature-YARN-2928.001.patch, YARN-4445.patch > > > Flow names are not sufficient to identify a flow. I noticed we used both > "flowName" and "flowId" to point to the same thing. We need to unify them to > flowName. Otherwise, front end users may think flow id is a top level concept > and try to directly locate a flow by its flow id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4445) Unify the term flowId and flowName in timeline v2 codebase
[ https://issues.apache.org/jira/browse/YARN-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059212#comment-15059212 ] Zhan Zhang commented on YARN-4445: -- rename > Unify the term flowId and flowName in timeline v2 codebase > -- > > Key: YARN-4445 > URL: https://issues.apache.org/jira/browse/YARN-4445 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Zhan Zhang > Labels: refactor > Attachments: YARN-4445-feature-YARN-2928.001.patch, YARN-4445.patch > > > Flow names are not sufficient to identify a flow. I noticed we used both > "flowName" and "flowId" to point to the same thing. We need to unify them to > flowName. Otherwise, front end users may think flow id is a top level concept > and try to directly locate a flow by its flow id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Status: Open (was: Patch Available) > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > Attachments: HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Status: Patch Available (was: Open) > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > Attachments: HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Attachment: HBASE-14849.patch > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > Attachments: HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Attachment: (was: HBASE-14849.patch) > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Status: Patch Available (was: Open) > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > Attachments: HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14849: --- Attachment: HBASE-14849.patch Migrate hbase configuration to SparkConf, and some cleanup. > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska > Assignee: Zhan Zhang > Attachments: HBASE-14849.patch > > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4445) Unify the term flowId and flowName in timeline v2 codebase
[ https://issues.apache.org/jira/browse/YARN-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated YARN-4445: - Attachment: YARN-4445.patch > Unify the term flowId and flowName in timeline v2 codebase > -- > > Key: YARN-4445 > URL: https://issues.apache.org/jira/browse/YARN-4445 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu > Assignee: Zhan Zhang > Labels: refactor > Attachments: YARN-4445.patch > > > Flow names are not sufficient to identify a flow. I noticed we used both > "flowName" and "flowId" to point to the same thing. We need to unify them to > flowName. Otherwise, front end users may think flow id is a top level concept > and try to directly locate a flow by its flow id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058997#comment-15058997 ] Zhan Zhang commented on HBASE-14795: [~jmhsieh] Thanks for bring this up. I am working on HBASE-14849, and also doing some cleanup work, which will also fix following warnings. warning: there were 3 feature warning(s); re-run with -feature for details Regarding below warning, it is a legacy one and HBASE-14159 has already open for it. warning: Class org.apache.hadoop.mapred.MiniMRCluster not found - continuing with a stub. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.0.0 > > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, > HBASE-14795-4.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Spark big rdd problem
You should be able to get the logs from yarn by “yarn logs -applicationId xxx”, where you can possible find the cause. Thanks. Zhan Zhang On Dec 15, 2015, at 11:50 AM, Eran Witkon wrote: > When running > val data = sc.wholeTextFile("someDir/*") data.count() > > I get numerous warning from yarn till I get aka association exception. > Can someone explain what happen when spark loads this rdd and can't fit it > all in memory? > Based on the exception it looks like the server is disconnecting from yarn > and failing... Any idea why? The code is simple but still failing... > Eran - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: About Spark On Hbase
If you want dataframe support, you can refer to https://github.com/zhzhan/shc, which I am working on to integrate to HBase upstream with existing support. Thanks. Zhan Zhang On Dec 15, 2015, at 4:34 AM, censj mailto:ce...@lotuseed.com>> wrote: hi,fight fate Did I can in bulkPut() function use Get value first ,then put this value to Hbase ? 在 2015年12月9日,16:02,censj mailto:ce...@lotuseed.com>> 写道: Thank you! I know 在 2015年12月9日,15:59,fightf...@163.com<mailto:fightf...@163.com> 写道: If you are using maven , you can add the cloudera maven repo to the repository in pom.xml and add the dependency of spark-hbase. I just found this : http://spark-packages.org/package/nerdammer/spark-hbase-connector as Feng Dongyu recommend, you can try this also, but I had no experience of using this. fightf...@163.com<mailto:fightf...@163.com> 发件人: censj<mailto:ce...@lotuseed.com> 发送时间: 2015-12-09 15:44 收件人: fightf...@163.com<mailto:fightf...@163.com> 抄送: user@spark.apache.org<mailto:user@spark.apache.org> 主题: Re: About Spark On Hbase So, I how to get this jar? I use set package project.I not found sbt lib. 在 2015年12月9日,15:42,fightf...@163.com<mailto:fightf...@163.com> 写道: I don't think it really need CDH component. Just use the API fightf...@163.com<mailto:fightf...@163.com> 发件人: censj<mailto:ce...@lotuseed.com> 发送时间: 2015-12-09 15:31 收件人: fightf...@163.com<mailto:fightf...@163.com> 抄送: user@spark.apache.org<mailto:user@spark.apache.org> 主题: Re: About Spark On Hbase But this is dependent on CDH。I not install CDH。 在 2015年12月9日,15:18,fightf...@163.com<mailto:fightf...@163.com> 写道: Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase Also, HBASE-13992<https://issues.apache.org/jira/browse/HBASE-13992> already integrates that feature into the hbase side, but that feature has not been released. Best, Sun. fightf...@163.com<mailto:fightf...@163.com> From: censj<mailto:ce...@lotuseed.com> Date: 2015-12-09 15:04 To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: About Spark On Hbase hi all, now I using spark,but I not found spark operation hbase open source. Do any one tell me?
Re: Multi-core support per task in Spark
I noticed that it is configurable in job level spark.task.cpus. Anyway to support on task level? Thanks. Zhan Zhang On Dec 11, 2015, at 10:46 AM, Zhan Zhang wrote: > Hi Folks, > > Is it possible to assign multiple core per task and how? Suppose we have some > scenario, in which some tasks are really heavy processing each record and > require multi-threading, and we want to avoid similar tasks assigned to the > same executors/hosts. > > If it is not supported, does it make sense to add this feature. It may seems > make user worry about more configuration, but by default we can still do 1 > core per task and only advanced users need to be aware of this feature. > > Thanks. > > Zhan Zhang > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Multi-core support per task in Spark
I noticed that it is configurable in job level spark.task.cpus. Anyway to support on task level? Thanks. Zhan Zhang On Dec 11, 2015, at 10:46 AM, Zhan Zhang wrote: > Hi Folks, > > Is it possible to assign multiple core per task and how? Suppose we have some > scenario, in which some tasks are really heavy processing each record and > require multi-threading, and we want to avoid similar tasks assigned to the > same executors/hosts. > > If it is not supported, does it make sense to add this feature. It may seems > make user worry about more configuration, but by default we can still do 1 > core per task and only advanced users need to be aware of this feature. > > Thanks. > > Zhan Zhang > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?
I think you are fetching too many results to the driver. Typically, it is not recommended to collect much data to driver. But if you have to, you can increase the driver memory, when submitting jobs. Thanks. Zhan Zhang On Dec 11, 2015, at 6:14 AM, Tom Seddon mailto:mr.tom.sed...@gmail.com>> wrote: I have a job that is running into intermittent errors with [SparkDriver] java.lang.OutOfMemoryError: Java heap space. Before I was getting this error I was getting errors saying the result size exceed the spark.driver.maxResultSize. This does not make any sense to me, as there are no actions in my job that send data to the driver - just a pull of data from S3, a map and reduceByKey and then conversion to dataframe and saveAsTable action that puts the results back on S3. I've found a few references to reduceByKey and spark.driver.maxResultSize having some importance, but cannot fathom how this setting could be related. Would greatly appreciated any advice. Thanks in advance, Tom
Re: Performance does not increase as the number of workers increasing in cluster mode
Not sure your data and model size. But intuitively, there is a tradeoff between parallel and network overhead. With the same data set and model, there is a optimum point of cluster size (performance may degrade at some point with the cluster size increment). You may want to test larger data set if you wan tot do some performance benchmark. Thanks. Zhan Zhang On Dec 11, 2015, at 9:34 AM, Wei Da mailto:xwd0...@qq.com>> wrote: Hi, all I have done a test in different HW configurations of Spark 1.5.0. A KMeans algorithm has been ran in four different Spark environments, the first one ran in local mode, the other three ran in cluster mode, all the nodes are with the same CPU (6 cores) and Memory (8G). The running times are recorded in the following. I thought the performance should increase as the number of workers increasing. But the result shows no obvious improvement. Does anybody know the reason? Thanks a lot in advance! The number of rows in test data is about 2.6 million, the input file is about 810M and stores in HDFS. [X] Following is snapshot of the Spark WebUI. [X] Wei Da Wei Da xwd0...@qq.com<mailto:xwd0...@qq.com>
Re: how to access local file from Spark sc.textFile("file:///path to/myfile")
As Sean mentioned, you cannot referring to the local file in your remote machine (executors). One walk around is to copy the file to all machines within same directory. Thanks. Zhan Zhang On Dec 11, 2015, at 10:26 AM, Lin, Hao mailto:hao@finra.org>> wrote: of the master node
Multi-core support per task in Spark
Hi Folks, Is it possible to assign multiple core per task and how? Suppose we have some scenario, in which some tasks are really heavy processing each record and require multi-threading, and we want to avoid similar tasks assigned to the same executors/hosts. If it is not supported, does it make sense to add this feature. It may seems make user worry about more configuration, but by default we can still do 1 core per task and only advanced users need to be aware of this feature. Thanks. Zhan Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Multi-core support per task in Spark
Hi Folks, Is it possible to assign multiple core per task and how? Suppose we have some scenario, in which some tasks are really heavy processing each record and require multi-threading, and we want to avoid similar tasks assigned to the same executors/hosts. If it is not supported, does it make sense to add this feature. It may seems make user worry about more configuration, but by default we can still do 1 core per task and only advanced users need to be aware of this feature. Thanks. Zhan Zhang - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Commented] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053169#comment-15053169 ] Zhan Zhang commented on HBASE-14849: [~ted.m] Please feel free to assign to me. > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska >Assignee: Ted Malaska > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14849) Add option to set block cache to false on SparkSQL executions
[ https://issues.apache.org/jira/browse/HBASE-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053168#comment-15053168 ] Zhan Zhang commented on HBASE-14849: I suggest to put this type of configuration into SparkConf, for example spark.hbase.blockcache.enable and we can also migrate existing configurations following the similar way. spark.hbase.blockcache.size spark.hbase.batchnum I also have not think out a good way to test it. One way is to create a new hbase default sources dedicated for testing (with buildScan overridden), and based on the configuration we return different result to verify the configuration is correctly pushed. But it does not test the feature itself. > Add option to set block cache to false on SparkSQL executions > - > > Key: HBASE-14849 > URL: https://issues.apache.org/jira/browse/HBASE-14849 > Project: HBase > Issue Type: New Feature > Components: spark >Reporter: Ted Malaska >Assignee: Ted Malaska > > I was working at a client with a ported down version of the Spark module for > HBase and realized we didn't add an option to turn of block cache for the > scans. > At the client I just disabled all caching with Spark SQL, this is an easy but > very impactful fix. > The fix for this patch will make this configurable -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052024#comment-15052024 ] Zhan Zhang commented on HBASE-14795: Thanks [~ted.m] and [~ted_yu] for the help. [~ted.m]If you don't mind, please share some information regarding your testing and any issue you find. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska >Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, > HBASE-14795-4.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051856#comment-15051856 ] Zhan Zhang commented on HBASE-14795: [~ted.m] Thanks for reviewing this. I have updated the reviewboard with context completion hook. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, > HBASE-14795-4.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14795: --- Attachment: HBASE-14795-4.patch > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch, > HBASE-14795-4.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14795: --- Attachment: HBASE-14795-3.patch > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch, HBASE-14795-3.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14795: --- Attachment: HBASE-14795-2.patch > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch, HBASE-14795-2.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046297#comment-15046297 ] Zhan Zhang commented on HBASE-14795: [~malaskat] I forget to publish it. It is available now. Sorry for that. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046013#comment-15046013 ] Zhan Zhang commented on HBASE-14795: Thanks for reviewing it. I have updated the revised one in the reviewboard. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14789) Enhance the current spark-hbase connector
[ https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14789: --- Attachment: HBASE-14795-1.patch solve review comments. > Enhance the current spark-hbase connector > - > > Key: HBASE-14789 > URL: https://issues.apache.org/jira/browse/HBASE-14789 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: shc.pdf > > > This JIRA is to optimize the RDD construction in the current connector > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14789) Enhance the current spark-hbase connector
[ https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14789: --- Attachment: (was: HBASE-14795-1.patch) > Enhance the current spark-hbase connector > - > > Key: HBASE-14789 > URL: https://issues.apache.org/jira/browse/HBASE-14789 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: shc.pdf > > > This JIRA is to optimize the RDD construction in the current connector > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14795: --- Attachment: HBASE-14795-1.patch solve review comments > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch, > HBASE-14795-1.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15039729#comment-15039729 ] Zhan Zhang commented on HBASE-14795: Sure. I cannot submit review in review board, and will consult other people how to do this. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14795: --- Attachment: 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14795: --- Status: Patch Available (was: Open) Initial patch to consolidate hbase-spark scan operations > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > Attachments: > 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch > > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014494#comment-15014494 ] Zhan Zhang commented on HBASE-14795: [~malaskat] The work is in progress. I may send out the PR after the holiday, as I have to finish some other tasks in parallel. I can include 14849 in the PR, or you can go ahead adding the support or wait for a while. I am OK with all these options. > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?
When you have following query, 'account=== “acct1” will be pushdown to generate new query with “where account = acct1” Thanks. Zhan Zhang On Nov 18, 2015, at 11:36 AM, Eran Medan mailto:eran.me...@gmail.com>> wrote: I understand that the following are equivalent df.filter('account === "acct1") sql("select * from tempTableName where account = 'acct1'") But is Spark SQL "smart" to also push filter predicates down for the initial load? e.g. sqlContext.read.jdbc(…).filter('account=== "acct1") Is Spark "smart enough" to this for each partition? ‘select … where account= ‘acc1’ AND (partition where clause here)? Or do I have to put it on each partition where clause otherwise it will load the entire set and only then filter it in memory? [https://mailfoogae.appspot.com/t?sender=aZWhyYW5uLm1laGRhbkBnbWFpbC5jb20%3D&type=zerocontent&guid=4e81181c-98d1-4dac-b047-a4c9e7d864d9]ᐧ
[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join
[ https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005682#comment-15005682 ] Zhan Zhang commented on SPARK-11704: [~maropu] You are right. I mean fetching from network is a big overhead. Feel free to work on it. > Optimize the Cartesian Join > --- > > Key: SPARK-11704 > URL: https://issues.apache.org/jira/browse/SPARK-11704 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Zhan Zhang > > Currently CartesianProduct relies on RDD.cartesian, in which the computation > is realized as follows > override def compute(split: Partition, context: TaskContext): Iterator[(T, > U)] = { > val currSplit = split.asInstanceOf[CartesianPartition] > for (x <- rdd1.iterator(currSplit.s1, context); > y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) > } > From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. > Which is really heavy and may never finished if n is large, especially when > rdd2 is coming from ShuffleRDD. > We should have some optimization on CartesianProduct by caching rightResults. > The problem is that we don’t have cleanup hook to unpersist rightResults > AFAIK. I think we should have some cleanup hook after query execution. > With the hook available, we can easily optimize such Cartesian join. I > believe such cleanup hook may also benefit other query optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11704) Optimize the Cartesian Join
[ https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005136#comment-15005136 ] Zhan Zhang edited comment on SPARK-11704 at 11/14/15 5:16 AM: -- I think we can add a register and a cleanup hook in query context. Before the query is performed, the register handlers are invoked (such as persist), and cleanup hooks are invoked (e.g., unpersist) after the query is done. By this way, the CartesianProduct we can cache the rightResult in the registered handler and unpersist the rightResult after the query. By this way, we avoid the recomputation of RDD2. In my testing, because rdd2 is quite small, I actually reverse the cartesian join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, the computation is done quite fast, but the original form cannot be finished. was (Author: zzhan): I think we can add a cleanup hook in SQLContext, and when the query is done, we invoke all the cleanup hook registered. By this way, the CartesianProduct we can cache the rightResult and register the cleanup handler(unpersist). By this way, we avoid the recomputation of RDD2. In my testing, because rdd2 is quite small, I actually reverse the cartesian join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, the computation is done quite fast, but the original form cannot be finished. > Optimize the Cartesian Join > --- > > Key: SPARK-11704 > URL: https://issues.apache.org/jira/browse/SPARK-11704 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Zhan Zhang > > Currently CartesianProduct relies on RDD.cartesian, in which the computation > is realized as follows > override def compute(split: Partition, context: TaskContext): Iterator[(T, > U)] = { > val currSplit = split.asInstanceOf[CartesianPartition] > for (x <- rdd1.iterator(currSplit.s1, context); > y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) > } > From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. > Which is really heavy and may never finished if n is large, especially when > rdd2 is coming from ShuffleRDD. > We should have some optimization on CartesianProduct by caching rightResults. > The problem is that we don’t have cleanup hook to unpersist rightResults > AFAIK. I think we should have some cleanup hook after query execution. > With the hook available, we can easily optimize such Cartesian join. I > believe such cleanup hook may also benefit other query optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join
[ https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005136#comment-15005136 ] Zhan Zhang commented on SPARK-11704: I think we can add a cleanup hook in SQLContext, and when the query is done, we invoke all the cleanup hook registered. By this way, the CartesianProduct we can cache the rightResult and register the cleanup handler(unpersist). By this way, we avoid the recomputation of RDD2. In my testing, because rdd2 is quite small, I actually reverse the cartesian join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, the computation is done quite fast, but the original form cannot be finished. > Optimize the Cartesian Join > --- > > Key: SPARK-11704 > URL: https://issues.apache.org/jira/browse/SPARK-11704 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Zhan Zhang > > Currently CartesianProduct relies on RDD.cartesian, in which the computation > is realized as follows > override def compute(split: Partition, context: TaskContext): Iterator[(T, > U)] = { > val currSplit = split.asInstanceOf[CartesianPartition] > for (x <- rdd1.iterator(currSplit.s1, context); > y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) > } > From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. > Which is really heavy and may never finished if n is large, especially when > rdd2 is coming from ShuffleRDD. > We should have some optimization on CartesianProduct by caching rightResults. > The problem is that we don’t have cleanup hook to unpersist rightResults > AFAIK. I think we should have some cleanup hook after query execution. > With the hook available, we can easily optimize such Cartesian join. I > believe such cleanup hook may also benefit other query optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join
[ https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005134#comment-15005134 ] Zhan Zhang commented on SPARK-11704: [~maropu] Maybe I misunderstand. If RDD2 is coming from ShuffleRDD, each new iterator will try to fetch from network because RDD2 is not cached. Is the ShuffleRDD cached automatically? > Optimize the Cartesian Join > --- > > Key: SPARK-11704 > URL: https://issues.apache.org/jira/browse/SPARK-11704 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Zhan Zhang > > Currently CartesianProduct relies on RDD.cartesian, in which the computation > is realized as follows > override def compute(split: Partition, context: TaskContext): Iterator[(T, > U)] = { > val currSplit = split.asInstanceOf[CartesianPartition] > for (x <- rdd1.iterator(currSplit.s1, context); > y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) > } > From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. > Which is really heavy and may never finished if n is large, especially when > rdd2 is coming from ShuffleRDD. > We should have some optimization on CartesianProduct by caching rightResults. > The problem is that we don’t have cleanup hook to unpersist rightResults > AFAIK. I think we should have some cleanup hook after query execution. > With the hook available, we can easily optimize such Cartesian join. I > believe such cleanup hook may also benefit other query optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11705) Eliminate unnecessary Cartesian Join
[ https://issues.apache.org/jira/browse/SPARK-11705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004744#comment-15004744 ] Zhan Zhang commented on SPARK-11705: simple reproduce step: import sqlContext.implicits._ case class SimpleRecord(key: Int, value: String) def withDF(name: String) = { val df = sc.parallelize((0 until 10).map(x => SimpleRecord(x, s"record_$x"))).toDF() df.registerTempTable(name) } withDF("p") withDF("s") withDF("l") val d = sqlContext.sql(s"select p.key, p.value, s.value, l.value from p, s, l where l.key = s.key and p.key = l.key") d.queryExecution.sparkPlan res15: org.apache.spark.sql.execution.SparkPlan = TungstenProject [key#0,value#1,value#3,value#5] SortMergeJoin [key#2,key#0], [key#4,key#4] CartesianProduct Scan PhysicalRDD[key#0,value#1] Scan PhysicalRDD[key#2,value#3] Scan PhysicalRDD[key#4,value#5] val d1 = sqlContext.sql(s"select p.key, p.value, s.value, l.value from s, l, p where l.key = s.key and p.key = l.key") d1.queryExecution.sparkPlan res16: org.apache.spark.sql.execution.SparkPlan = TungstenProject [key#0,value#1,value#3,value#5] SortMergeJoin [key#4], [key#0] TungstenProject [key#4,value#5,value#3] SortMergeJoin [key#2], [key#4] Scan PhysicalRDD[key#2,value#3] Scan PhysicalRDD[key#4,value#5] Scan PhysicalRDD[key#0,value#1] > Eliminate unnecessary Cartesian Join > > > Key: SPARK-11705 > URL: https://issues.apache.org/jira/browse/SPARK-11705 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Zhan Zhang > > When we have some queries similar to following (don’t remember the exact > form): > select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and c.key3 > = d.key3 > There will be a cartesian join between a and b. But if we just simply change > the table order, for example from a, c, b, d, such cartesian join are > eliminated. > Without such manual tuning, the query will never finish if a, b are big. But > we should not relies on such manual optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
Zhan Zhang created HBASE-14801: -- Summary: Enhance the Spark-HBase connector catalog with json format Key: HBASE-14801 URL: https://issues.apache.org/jira/browse/HBASE-14801 Project: HBase Issue Type: Improvement Reporter: Zhan Zhang Assignee: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
Zhan Zhang created HBASE-14801: -- Summary: Enhance the Spark-HBase connector catalog with json format Key: HBASE-14801 URL: https://issues.apache.org/jira/browse/HBASE-14801 Project: HBase Issue Type: Improvement Reporter: Zhan Zhang Assignee: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14795) Enhance the spark-hbase scan operations
[ https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14795: --- Summary: Enhance the spark-hbase scan operations (was: Provide an alternative spark-hbase SQL implementations for Scan) > Enhance the spark-hbase scan operations > --- > > Key: HBASE-14795 > URL: https://issues.apache.org/jira/browse/HBASE-14795 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > > This is a sub-jira of HBASE-14789. This jira is to focus on the replacement > of TableInputFormat for a more custom scan implementation that will make the > following use case more effective. > Use case: > In the case you have multiple scan ranges on a single table with in a single > query. TableInputFormat will scan the the outer range of the scan start and > end range where this implementation can be more pointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14796) Enhance the Gets in the connector
[ https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14796: --- Summary: Enhance the Gets in the connector (was: Provide an alternative spark-hbase SQL implementations for Gets) > Enhance the Gets in the connector > - > > Key: HBASE-14796 > URL: https://issues.apache.org/jira/browse/HBASE-14796 > Project: HBase > Issue Type: Improvement >Reporter: Ted Malaska > Assignee: Zhan Zhang >Priority: Minor > > Current the Spark-Module Spark SQL implementation gets records from HBase > from the driver if there is something like the following found in the SQL. > rowkey = 123 > The reason for this original was normal sql will not have many equal > operations in a single where clause. > Zhan, had brought up too points that have value. > 1. The SQL may be generated and may have many many equal statements in it so > moving the work to an executor protects the driver from load > 2. In the correct implementation the drive is connecting to HBase and > exceptions may cause trouble with the Spark application and not just with the > a single task execution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14789) Enhance the current spark-hbase connector
[ https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14789: --- Description: This JIRA is to optimize the RDD construction in the current connector implementation. (was: This JIRA is to provide user an option to choose different Spark-HBase implementation based on requirements.) > Enhance the current spark-hbase connector > - > > Key: HBASE-14789 > URL: https://issues.apache.org/jira/browse/HBASE-14789 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: shc.pdf > > > This JIRA is to optimize the RDD construction in the current connector > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14789) Enhance the current spark-hbase connector
[ https://issues.apache.org/jira/browse/HBASE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14789: --- Summary: Enhance the current spark-hbase connector (was: Provide an alternative spark-hbase connector) > Enhance the current spark-hbase connector > - > > Key: HBASE-14789 > URL: https://issues.apache.org/jira/browse/HBASE-14789 > Project: HBase > Issue Type: Improvement > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: shc.pdf > > > This JIRA is to provide user an option to choose different Spark-HBase > implementation based on requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)