Re: How to Keep Null values in Parquet

2018-11-21 Thread Soumya D. Sanyal
Hi Chetan, Have you tried casting the null values/columns to a supported type — e.g. `StringType`, `IntegerType`, etc? See also https://issues.apache.org/jira/browse/SPARK-10943 . — Soumya > On Nov 21, 2018, at 9:29 PM, Chetan Khatri >

Re: getBytes : save as pdf

2018-10-10 Thread Joel D
I haven’t tried this but maybe you can try using some pdf library to write the binary contents as pdf. On Wed, Oct 10, 2018 at 11:30 AM ☼ R Nair wrote: > All, > > I am reading a zipped file into an RdD and getting the rdd._1as the name > and rdd._2.getBytes() as the content. How can I save the

Process Million Binary Files

2018-10-10 Thread Joel D
Hi, I need to process millions of PDFs in hdfs using spark. First I’m trying with some 40k files. I’m using binaryFiles api with which I’m facing couple of issues: 1. It creates only 4 tasks and I can’t seem to increase the parallelism there. 2. It took 2276 seconds and that means for millions

Re: Text from pdf spark

2018-09-28 Thread Joel D
ent from my iPhone > > On Sep 28, 2018, at 12:10 PM, Joel D wrote: > > I'm trying to extract text from pdf files in hdfs using pdfBox. > > However it throws an error: > > "Exception in thread "main" org.apache.spark.SparkException: ... > > java.io.FileNo

Text from pdf spark

2018-09-28 Thread Joel D
I'm trying to extract text from pdf files in hdfs using pdfBox. However it throws an error: "Exception in thread "main" org.apache.spark.SparkException: ... java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf (No such file or directory)" What am I missing? Should I be working with

Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-28 Thread Gokula Krishnan D
--+ > > |hash(40514X)|hash(41751)| > > ++---+ > > | -1898845883| 916273350| > > ++---+ > > > > > > scala> spark.sql("select hash('14589'),hash('40004')").show() > > +---+---+ > > |hash(1

[Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-25 Thread Gokula Krishnan D
Hello All, I am calculating the hash value of few columns and determining whether its an Insert/Delete/Update Record but found a scenario which is little weird since some of the records returns same hash value though the key's are totally different. For the instance, scala> spark.sql("select

Re: Mulitple joins with same Dataframe throws AnalysisException: resolved attribute(s)

2018-07-19 Thread Joel D
One workaround is to rename the fid column for each df before joining. On Thu, Jul 19, 2018 at 9:50 PM wrote: > Spark 2.3.0 has this problem upgrade it to 2.3.1 > > Sent from my iPhone > > On Jul 19, 2018, at 2:13 PM, Nirav Patel wrote: > > corrected subject line. It's missing attribute error

Re: How to avoid duplicate column names after join with multiple conditions

2018-07-06 Thread Gokula Krishnan D
Nirav, withColumnRenamed() API might help but it does not different column and renames all the occurrences of the given column. either use select() API and rename as you want. Thanks & Regards, Gokula Krishnan* (Gokul)* On Mon, Jul 2, 2018 at 5:52 PM, Nirav Patel wrote: > Expr is `df1(a)

Re: [EXTERNAL] - Re: testing frameworks

2018-05-22 Thread Joel D
We’ve developed our own version of testing framework consisting of different areas of checking, sometimes providing expected data and comparing with the resultant data from the data object. Cheers. On Tue, May 22, 2018 at 1:48 PM Steve Pruitt wrote: > Something more on

No Tasks have reported metrics yet

2018-01-10 Thread Joel D
Hi, I've a job which takes a HiveQL joining 2 tables (2.5 TB, 45GB), repartitions to 100 and then does some other transformations. This executed fine earlier. Job stages: Stage 0: hive table 1 scan Stage 1: Hive table 2 scan Stage 2: Tungsten exchange for the join Stage 3: Tungsten exchange for

Re: Process large JSON file without causing OOM

2017-11-13 Thread Joel D
Have you tried increasing driver, exec mem (gc overhead too if required)? your code snippet and stack trace will be helpful. On Mon, Nov 13, 2017 at 7:23 PM Alec Swan wrote: > Hello, > > I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB > format.

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-29 Thread Gokula Krishnan D
Do you see any changes or improvments in the *Core-API* in Spark 2.X when compared with Spark 1.6.0. ?. Thanks & Regards, Gokula Krishnan* (Gokul)* On Mon, Sep 25, 2017 at 1:32 PM, Gokula Krishnan D <email2...@gmail.com> wrote: > Thanks for the reply. Forgot to mention that,

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-25 Thread Gokula Krishnan D
2. Try to compare the CPU time instead of the wall-clock time 3. Check the stages that got slower and compare the DAGs 4. Test with dynamic allocation disabled On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D <email2...@gmail.com> wrote: > Hello All, > > Currently our Batch ETL Jo

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-22 Thread Gokula Krishnan D
actors that influence that. > > 2. Try to compare the CPU time instead of the wall-clock time > > 3. Check the stages that got slower and compare the DAGs > > 4. Test with dynamic allocation disabled > > On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D <email2...@gmail.com

What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-22 Thread Gokula Krishnan D
Hello All, Currently our Batch ETL Jobs are in Spark 1.6.0 and planning to upgrade into Spark 2.1.0. With minor code changes (like configuration and Spark Session.sc) able to execute the existing JOB into Spark 2.1.0. But noticed that JOB completion timings are much better in Spark 1.6.0 but no

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
nt-core/src/main/java/org/apache/hadoop/mapred/InputFormat.java#L85>, > and the partition desired is at most a hint, so the final result could be a > bit different. > > On 25 July 2017 at 19:54, Gokula Krishnan D <email2...@gmail.com> wrote: > >> Excuse for the too

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Excuse for the too many mails on this post. found a similar issue https://stackoverflow.com/questions/24671755/how-to-partition-a-rdd Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D <email2...@gmail.com> wrote: > In addition to th

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
In addition to that, tried to read the same file with 3000 partitions but it used 3070 partitions. And took more time than previous please refer the attachment. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com> wrote

[Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Hello All, I have a HDFS file with approx. *1.5 Billion records* with 500 Part files (258.2GB Size) and when I tried to execute the following I could see that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it? val inputFile = val inputRdd = sc.textFile(inputFile)

Re: Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-21 Thread Gokula Krishnan D
> >> https://spark.apache.org/docs/latest/job-scheduling.html#sch >> eduling-across-applications >> https://spark.apache.org/docs/latest/job-scheduling.html#sch >> eduling-within-an-application >> >> On Thu, Jul 20, 2017 at 2:02 PM, Gokula Krishnan D <email

Re: Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-20 Thread Gokula Krishnan D
scheduled using the prior > Task's resources -- the fair scheduler is not preemptive of running Tasks. > > On Thu, Jul 20, 2017 at 1:45 PM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> Hello All, >> >> We are having cluster with 50 Executors each with

Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-20 Thread Gokula Krishnan D
Hello All, We are having cluster with 50 Executors each with 4 Cores so can avail max. 200 Executors. I am submitting a Spark application(JOB A) with scheduler.mode as FAIR and dynamicallocation=true and it got all the available executors. In the meantime, submitting another Spark Application

Spark sc.textFile() files with more partitions Vs files with less partitions

2017-07-20 Thread Gokula Krishnan D
Hello All, our Spark Applications are designed to process the HDFS Files (Hive External Tables). Recently modified the Hive file size by setting the following parameters to ensure that files are having with the average size of 512MB. set hive.merge.mapfiles=true set hive.merge.mapredfiles=true

[Spark STREAMING]: Can not kill job gracefully on spark standalone cluster

2017-06-08 Thread Mariusz D.
There is a problem with killing jobs gracefully in spark 2.1.0 with enabled spark.streaming.stopGracefullyOnShutdown I tested killing spark jobs in many ways and I got some conclusions. 1. With command spark-submit --master spark:// --kill driver-id results: It killed all workers almost

Schema Evolution Parquet vs Avro

2017-05-29 Thread Joel D
Hi, We are trying to come up with the best storage format for handling schema changes in ingested data. We noticed that both avro and parquet allows one to select based on column name instead of the data index/position of data. However, we are inclined towards parquet for better read performance

Re: unit testing in spark

2017-04-10 Thread Gokula Krishnan D
Hello Shiv, Unit Testing is really helping when you follow TDD approach. And it's a safe way to code a program locally and also you can make use those test cases during the build process by using any of the continuous integration tools ( Bamboo, Jenkins). If so you can ensure that artifacts are

Re: Executors - running out of memory

2017-01-19 Thread Venkata D
blondowski, How big is your JSON file. Is it possible to post the spark params or configurations here, maybe that might get to some idea about the issue. Thanks On Thu, Jan 19, 2017 at 4:21 PM, blondowski wrote: > Please bear with me..I'm fairly new to spark. Running

Re: Task Deserialization Error

2016-09-21 Thread Gokula Krishnan D
Hello Sumit - I could see that SparkConf() specification is not being mentioned in your program. But rest looks good. Output: By the way, I have used the README.md template https://gist.github.com/jxson/1784669 Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Sep 20, 2016 at 2:15 AM,

Things to do learn Cassandra in Apache Spark Environment

2016-08-23 Thread Gokula Krishnan D
Hello All - Hope, you are doing good. I have a general question. I am working on Hadoop using Apache Spark. At this moment, we are not using Cassandra but I would like to know what's the scope of learning and using it in the Hadoop environment. It would be great if you could provide the use

Re: Get both feature importance and ROC curve from a random forest classifier

2016-07-06 Thread Mathieu D
well, sounds trivial now ... ! thanks ;-) 2016-07-02 10:04 GMT+02:00 Yanbo Liang : > Hi Mathieu, > > Using the new ml package to train a RandomForestClassificationModel, you > can get feature importance. Then you can convert the prediction result to > RDD and feed it into

Re: Kafka Streaming and partitioning

2016-01-13 Thread David D
Yep that's exactly what we want. Thanks for all the info Cody. Dave. On 13 Jan 2016 18:29, "Cody Koeninger" wrote: > The idea here is that the custom partitioner shouldn't actually get used > for repartitioning the kafka stream (because that would involve a shuffle, > which

How to view the RDD data based on Partition

2016-01-12 Thread Gokula Krishnan D
Hello All - I'm just trying to understand aggregate() and in the meantime got an question. *Is there any way to view the RDD databased on the partition ?.* For the instance, the following RDD has 2 partitions val multi2s = List(2,4,6,8,10,12,14,16,18,20) val multi2s_RDD =

Re: How to view the RDD data based on Partition

2016-01-12 Thread Gokula Krishnan D
nt]) : Iterator[String] = { > iter.toList.map(x => index + "," + x).iterator > } > x.mapPartitionsWithIndex(myfunc).collect() > res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9) > > On Tue, Jan 12, 2016 at 2:06 PM, Gokula Krishnan D <email2...@gm

Re: Problem with WINDOW functions?

2015-12-30 Thread Gokula Krishnan D
Hello Vadim - Alternatively, you can achieve by using the *window functions* which is available from 1.4.0 *code_value.txt (Input)* = 1000,200,Descr-200,01 1000,200,Descr-200-new,02 1000,201,Descr-201,01 1000,202,Descr-202-new,03 1000,202,Descr-202,01

Re: difference between ++ and Union of a RDD

2015-12-29 Thread Gokula Krishnan D
Ted - Thanks for the updates. Then its the same case with sc.parallelize() or sc.makeRDD() right. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Dec 29, 2015 at 1:43 PM, Ted Yu wrote: > From RDD.scala : > > def ++(other: RDD[T]): RDD[T] = withScope { >

Re: How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-23 Thread Gokula Krishnan D
You can try this .. But slightly modified the input structure since first two columns were not in Json format. [image: Inline image 1] Thanks & Regards, Gokula Krishnan* (Gokul)* On Wed, Dec 23, 2015 at 9:46 AM, Eran Witkon wrote: > Did you get a solution for this? > >

Database does not exist: (Spark-SQL ===> Hive)

2015-12-14 Thread Gokula Krishnan D
Hello All - I tried to execute a Spark-Scala Program in order to create a table in HIVE and faced couple of error so I just tried to execute the "show tables" and "show databases" And I have already created a database named "test_db".But I have encountered the error "Database does not exist"

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
Hello Prashant - Can you please try like this : For the instance, input file name is "student_detail.txt" and ID,Name,Sex,Age === 101,Alfred,Male,30 102,Benjamin,Male,31 103,Charlie,Female,30 104,Julie,Female,30 105,Maven,Male,30 106,Dexter,Male,30 107,Lundy,Male,32

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
> > On Wed, Dec 9, 2015 at 9:33 PM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> Hello Prashant - >> >> Can you please try like this : >> >> For the instance, input file name is "student_detail.txt" and >> >> ID,Name,S

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
2 AM, Gokula Krishnan D <email2...@gmail.com> wrote: > Ok, then you can slightly change like > > [image: Inline image 1] > > Thanks & Regards, > Gokula Krishnan* (Gokul)* > > > On Wed, Dec 9, 2015 at 11:09 AM, Prashant Bhardwaj < > prashant2006s...@gmail

How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Gokula Krishnan D
Hello All - In spark-shell when we press tab after . ; we could see the possible list of transformations and actions. But unable to see all the list. is there any other way to get the rest of the list. I'm mainly looking for sortByKey() val sales_RDD = sc.textFile("Data/Scala/phone_sales.txt")

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Gokula Krishnan D
value pair to > work. I think in scala their are transformation such as .toPairRDD(). > > On Sat, Dec 5, 2015 at 12:01 AM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> Hello All - >> >> In spark-shell when we press tab after . ; we could see the >

Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread D
mentioned here <https://github.com/rabbitmq/rabbitmq-tutorials/blob/master/java/Recv.java>. Environment * RabbitMq Version - 3.5.6 * Spark 1.5.2 * Java 8 (Update 66) Can some one let me know what is going wrong and how can I read message from RabbitMq via Spark Streaming. Thanks, D

Reusing Spark Functions

2015-10-14 Thread Starch, Michael D (398M)
All, Is a Function object in Spark reused on a given executor, or is sent and deserialized with each new task? On my project, we have functions that incur a very large setup cost, but then could be called many times. Currently, I am using object deserialization to run this intensive setup,

ExternalSorter: Thread *** spilling in-memory map of 352.6 MB to disk (38 times so far)

2015-08-24 Thread d...@lumity.com
Hello, I'm trying to run a spark 1.5 job with: ./spark-shell --driver-java-options -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044 -Xms16g -Xmx48g -Xss128m I get lots of error messages like : 15/08/24 20:24:33 INFO ExternalSorter: Thread 172 spilling in-memory map of

Re: spark ec2 as non-root / any plan to improve that in the future ?

2015-07-10 Thread Mathieu D
Quick and clear answer thank you. 2015-07-09 21:07 GMT+02:00 Nicholas Chammas nicholas.cham...@gmail.com: No plans to change that at the moment, but agreed it is against accepted convention. It would be a lot of work to change the tool, change the AMIs, and test everything. My suggestion is

custom join using complex keys

2015-05-09 Thread Mathieu D
Hi folks, I need to join RDDs having composite keys like this : (K1, K2 ... Kn). The joining rule looks like this : * if left.K1 == right.K1, then we have a true equality, and all K2... Kn are also equal. * if left.K1 != right.K1 but left.K2 == right.K2, I have a partial equality, and I also

Re: amp lab spark streaming twitter example

2014-08-25 Thread Forest D
: Could you be hitting this? https://issues.apache.org/jira/browse/SPARK-3178 On Sun, Aug 24, 2014 at 10:21 AM, Forest D dev24a...@gmail.com wrote: Hi folks, I have been trying to run the AMPLab’s twitter streaming example (http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing

amp lab spark streaming twitter example

2014-08-24 Thread Forest D
Hi folks, I have been trying to run the AMPLab’s twitter streaming example (http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html) in the last 2 days.I have encountered the same error messages as shown below: 14/08/24 17:14:22 ERROR