Collecting large dataset
Hi. I have been trying to collect a large dataset(about 2 gb in size, 30 columns, more than a million rows) onto the driver side. I am aware that collecting such a huge dataset isn't suggested, however, the application within which the spark driver is running requires that data. While collecting the dataframe, the spark job throws an error, TaskResultLost( resultset lost from blockmanager). I searched for solutions around this and set the following properties: spark.blockManager.port, maxResultSize to 0(unlimited), spark.driver.blockManager.port and the application within which spark driver is running has 28 gb of max heap size. And yet the error arises again. There are 22 executors running in my cluster. Is there any config/necessary step that i am missing before collecting such large data? Or is there any other effective approach that would guarantee collecting such large data without failure? Thanks, Rishikesh
How to combine all rows into a single row in DataFrame
Hi All, I have been trying to serialize a dataframe in protobuf format. So far, I have been able to serialize every row of the dataframe by using map function and the logic for serialization within the same(within the lambda function). The resultant dataframe consists of rows in serialized format(1 row = 1 serialized message). I wish to form a single protobuf serialized message for this dataframe and in order to do that i need to combine all the serialized rows using some custom logic very similar to the one used in map operation. I am assuming that this would be possible by using the reduce operation on the dataframe, however, i am unaware of how to go about it. Any suggestions/approach would be much appreciated. Thanks, Rishikesh
Re: Hive external table not working in sparkSQL when subdirectories are present
Hi, I did not explicitly create a Hive Context. I have been using the spark.sqlContext that gets created upon launching the spark-shell. Isn't this sqlContext same as the hiveContext? Thanks, Rishikesh On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke wrote: > Do you use the HiveContext in Spark? Do you configure the same options > there? Can you share some code? > > Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade >: > > Hi. > I am using Spark 2.3.2 and Hive 3.1.0. > Even if i use parquet files the result would be same, because after all > sparkSQL isn't able to descend into the subdirectories over which the table > is created. Could there be any other way? > Thanks, > Rishikesh > > On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh > wrote: > >> which versions of Spark and Hive are you using. >> >> what will happen if you use parquet tables instead? >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade >> wrote: >> >>> Hi. >>> I have built a Hive external table on top of a directory 'A' which has >>> data stored in ORC format. This directory has several subdirectories inside >>> it, each of which contains the actual ORC files. >>> These subdirectories are actually created by spark jobs which ingest >>> data from other sources and write it into this directory. >>> I tried creating a table and setting the table properties of the same as >>> *hive.mapred.supports.subdirectories=TRUE* and >>> *mapred.input.dir.recursive**=TRUE*. >>> As a result of this, when i fire the simplest query of *select count(*) >>> from ExtTable* via the Hive CLI, it successfully gives me the expected >>> count of records in the table. >>> However, when i fire the same query via sparkSQL, i get count = 0. >>> >>> I think the sparkSQL isn't able to descend into the subdirectories for >>> getting the data while hive is able to do so. >>> Are there any configurations needed to be set on the spark side so that >>> this works as it does via hive cli? >>> I am using Spark on YARN. >>> >>> Thanks, >>> Rishikesh >>> >>> Tags: subdirectories, subdirectory, recursive, recursion, hive external >>> table, orc, sparksql, yarn >>> >>
Re: Hive external table not working in sparkSQL when subdirectories are present
Hi. I am using Spark 2.3.2 and Hive 3.1.0. Even if i use parquet files the result would be same, because after all sparkSQL isn't able to descend into the subdirectories over which the table is created. Could there be any other way? Thanks, Rishikesh On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh wrote: > which versions of Spark and Hive are you using. > > what will happen if you use parquet tables instead? > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade > wrote: > >> Hi. >> I have built a Hive external table on top of a directory 'A' which has >> data stored in ORC format. This directory has several subdirectories inside >> it, each of which contains the actual ORC files. >> These subdirectories are actually created by spark jobs which ingest data >> from other sources and write it into this directory. >> I tried creating a table and setting the table properties of the same as >> *hive.mapred.supports.subdirectories=TRUE* and >> *mapred.input.dir.recursive**=TRUE*. >> As a result of this, when i fire the simplest query of *select count(*) >> from ExtTable* via the Hive CLI, it successfully gives me the expected >> count of records in the table. >> However, when i fire the same query via sparkSQL, i get count = 0. >> >> I think the sparkSQL isn't able to descend into the subdirectories for >> getting the data while hive is able to do so. >> Are there any configurations needed to be set on the spark side so that >> this works as it does via hive cli? >> I am using Spark on YARN. >> >> Thanks, >> Rishikesh >> >> Tags: subdirectories, subdirectory, recursive, recursion, hive external >> table, orc, sparksql, yarn >> >
Hive external table not working in sparkSQL when subdirectories are present
Hi. I have built a Hive external table on top of a directory 'A' which has data stored in ORC format. This directory has several subdirectories inside it, each of which contains the actual ORC files. These subdirectories are actually created by spark jobs which ingest data from other sources and write it into this directory. I tried creating a table and setting the table properties of the same as *hive.mapred.supports.subdirectories=TRUE* and *mapred.input.dir.recursive* *=TRUE*. As a result of this, when i fire the simplest query of *select count(*) from ExtTable* via the Hive CLI, it successfully gives me the expected count of records in the table. However, when i fire the same query via sparkSQL, i get count = 0. I think the sparkSQL isn't able to descend into the subdirectories for getting the data while hive is able to do so. Are there any configurations needed to be set on the spark side so that this works as it does via hive cli? I am using Spark on YARN. Thanks, Rishikesh Tags: subdirectories, subdirectory, recursive, recursion, hive external table, orc, sparksql, yarn
Re: Connecting to Spark cluster remotely
To put it simply, what are the configurations that need to be done on the client machine so that it can run driver on itself and executors on spark-yarn cluster nodes? On Mon, Apr 22, 2019, 8:22 PM Rishikesh Gawade wrote: > Hi. > I have been experiencing trouble while trying to connect to a Spark > cluster remotely. This Spark cluster is configured to run using YARN. > Can anyone guide me or provide any step-by-step instructions for > connecting remotely via spark-shell? > Here's the setup that I am using: > The Spark cluster is running with each node as a docker container hosted > on a VM. It is using YARN for scheduling resources for computations. > I have a dedicated docker container acting as a spark client, on which i > have the spark-shell installed(spark binary in standalone setup) and also > the Hadoop and Yarn config directories set so that spark-shell can > coordinate with the RM for resources. > With all of this set, i tried using the following command: > > spark-shell --master yarn --deploy-mode client > > This results in the spark-shell giving me a scala-based console, however, > when I check the Resource Manager UI on the cluster, there seems to be no > application/spark session running. > I have been expecting the driver to be running on the client machine and > the executors running in the cluster. But that doesn't seem to happen. > > How can I achieve this? > Is whatever I am trying feasible, and if so, a good practice? > > Thanks & Regards, > Rishikesh >
Connecting to Spark cluster remotely
Hi. I have been experiencing trouble while trying to connect to a Spark cluster remotely. This Spark cluster is configured to run using YARN. Can anyone guide me or provide any step-by-step instructions for connecting remotely via spark-shell? Here's the setup that I am using: The Spark cluster is running with each node as a docker container hosted on a VM. It is using YARN for scheduling resources for computations. I have a dedicated docker container acting as a spark client, on which i have the spark-shell installed(spark binary in standalone setup) and also the Hadoop and Yarn config directories set so that spark-shell can coordinate with the RM for resources. With all of this set, i tried using the following command: spark-shell --master yarn --deploy-mode client This results in the spark-shell giving me a scala-based console, however, when I check the Resource Manager UI on the cluster, there seems to be no application/spark session running. I have been expecting the driver to be running on the client machine and the executors running in the cluster. But that doesn't seem to happen. How can I achieve this? Is whatever I am trying feasible, and if so, a good practice? Thanks & Regards, Rishikesh
How to use same SparkSession in another app?
Hi. I wish to use a SparkSession created by one app in another app so that i can use the dataframes belonging to that session. Is it possible to use the same sparkSession in another app? Thanks, Rishikesh
Error: NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT while running a Spark-Hive Job
Hello there, I am using *spark-2.3.0* compiled using the following Maven Command: *mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean install.* I have configured it to run with *Hive v2.3.3*. Also, all the Hive related jars (*v1.2.1*) in the Spark's JAR folder have been replaced by all the JARs available in Hive's *lib* folder and i have also configured Hive to use *spark* as an *execution engine*. After that, in my Java application i have written the following lines: SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .config("hive.metastore.warehouse.dir","/user/hive/warehouse") .config("hive.metastore.uris","thrift://hadoopmaster:9083") .enableHiveSupport() .getOrCreate(); HiveContext hc = new HiveContext(spark); hc.sql("SHOW DATABASES").show(); I built the project thereafter (no compilation errors) and tried running the job using spark-submit command as follows: spark-submit --master yarn --class org.adbms.SpamFilter IdeaProjects/mlproject/target/mlproject-1.0-SNAPSHOT.jar On doing so, i received the following error: Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194) and more... I have no idea what this is about. Is this because Spark v2.3.0 and Hive v2.3.3 aren't compatible with each other? If i have done anything wrong, i request you to point it out and suggest me the required changes. Also, if it's the case that i might have misconfigured spark and hive, please suggest me the changes in configuration, a link guiding through all necessary configs would also be appreciated. Thank you in anticipation. Regards, Rishikesh Gawade
ERROR: Hive on Spark
Hello there. I am a newbie in the world of Spark. I have been working on a Spark Project using Java. I have configured Hive and Spark to run on Hadoop. As of now i have created a Hive (derby) database on Hadoop HDFS at the given location(warehouse location): */user/hive/warehouse *and database name as : *spam *(saved as *spam.db* at the aforementioned location). I have been trying to read tables in this database in spark to create RDDs/DataFrames. Could anybody please guide me in how I can achieve this? I used the following statements in my Java Code: SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example").master("yarn") .config("spark.sql.warehouse.dir","/user/hive/warehouse") .enableHiveSupport() .getOrCreate(); spark.sql("USE spam"); spark.sql("SELECT * FROM spamdataset").show(); After this i built the project using Maven as follows: mvn clean package -DskipTests and a JAR was generated. After this, I tried running the project via spark-submit CLI using : spark-submit --class com.adbms.SpamFilter --master yarn ~/IdeaProjects/mlproject/target/mlproject-1.0-SNAPSHOT.jar and got the following error: Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spam' not found; at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$ apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists( SessionCatalog.scala:174) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase( SessionCatalog.scala:256) at org.apache.spark.sql.execution.command.SetDatabaseCommand.run( databases.scala:59) at org.apache.spark.sql.execution.command.ExecutedCommandExec. sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec. sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec. executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId( SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) at com.adbms.SpamFilter.main(SpamFilter.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start( SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Also, I replaced the SQL query with "SHOW DATABASES", and it showed only one database namely "default". Those stored on HDFS warehouse dir weren't shown. I request you to please check this and if anything is wrong then please suggest an ideal way to read Hive tables on Hadoop in Spark using Java. A link to a webpage having relevant info would also be appreciated. Thank you in anticipation. Regards, Rishikesh Gawade
Accessing Hive Database (On Hadoop) using Spark
Hello there. I am a newbie in the world of Spark. I have been working on a Spark Project using Java. I have configured Hive and Spark to run on Hadoop. As of now i have created a Hive (derby) database on Hadoop HDFS at the given location(warehouse location): */user/hive/warehouse *and database name as : *spam *(saved as *spam.db* at the aforementioned location). I have been trying to read tables in this database in spark to create RDDs/DataFrames. Could anybody please guide me in how I can achieve this? I used the following statements in my Java Code: SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example").master("yarn") .config("spark.sql.warehouse.dir","/user/hive/warehouse") .enableHiveSupport() .getOrCreate(); spark.sql("USE spam"); spark.sql("SELECT * FROM spamdataset").show(); After this i built the project using Maven as follows: mvn clean package -DskipTests and a JAR was generated. After this, I tried running the project via spark-submit CLI using : spark-submit --class com.adbms.SpamFilter --master yarn ~/IdeaProjects/mlproject/target/mlproject-1.0-SNAPSHOT.jar and got the following error: Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spam' not found; at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org $apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(SessionCatalog.scala:174) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:256) at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) at com.adbms.SpamFilter.main(SpamFilter.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) I request you to please check this and if anything is wrong then please suggest an ideal way to read Hive tables on Hadoop in Spark using Java. A link to a webpage having relevant info would also be appreciated. Thank you in anticipation. Regards, Rishikesh Gawade