Re: Do we need schema for Parquet files with Spark?

2016-03-03 Thread Xinh Huynh
Hi Ashok, On the Spark SQL side, when you create a dataframe, it will have a schema (each column has a type such as Int or String). Then when you save that dataframe as parquet format, Spark translates the dataframe schema into Parquet data types. (See spark.sql.execution.datasources.parquet.)

Re: Do we need schema for Parquet files with Spark?

2016-03-03 Thread ashokkumar rajendran
Hi Ted, Thanks for pointing out this. This page has mailing list for developers but not for users yet it seems. Including developers mailing list only. Hi Parquet Team, Could you please clarify the question below? Please let me know if there is a separate mailing list for users but not

Re: AVRO vs Parquet

2016-03-03 Thread Koert Kuipers
well can you use orc without bringing in the kitchen sink of dependencies also known as hive? On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim wrote: > How about ORC? I have experimented briefly with Parquet and ORC, and I > liked the fact that ORC has its schema within the

Spark 1.5.2 - Read custom schema from file

2016-03-03 Thread Divya Gehlot
Hi, I have defined a custom schema as shown below : val customSchema = StructType( > StructField("year", IntegerType, true), > StructField("make", StringType, true), > StructField("model", StringType, true), > StructField("comment", StringType, true), StructField("blank",

Re: Do we need schema for Parquet files with Spark?

2016-03-03 Thread Ted Yu
Have you taken a look at https://parquet.apache.org/community/ ? On Thu, Mar 3, 2016 at 7:32 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi, > > I am exploring to use Apache Parquet with Spark SQL in our project. I > notice that Apache Parquet uses different encoding for

Re: AVRO vs Parquet

2016-03-03 Thread Jong Wook Kim
How about ORC? I have experimented briefly with Parquet and ORC, and I liked the fact that ORC has its schema within the file, which makes it handy to work with any other tools. Jong Wook On 3 March 2016 at 23:29, Don Drake wrote: > My tests show Parquet has better

Re: AVRO vs Parquet

2016-03-03 Thread Don Drake
My tests show Parquet has better performance than Avro in just about every test. It really shines when you are querying a subset of columns in a wide table. -Don On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann wrote: > Which format is the best format for SparkSQL adhoc

Re: No event log in /tmp/spark-events

2016-03-03 Thread PatrickYu
alvarobrandon wrote > Just write /tmp/sparkserverlog without the file part. I don't get your point, what's mean of 'without the file part' -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-event-log-in-tmp-spark-events-tp26318p26394.html Sent from the

[Issue:]Getting null values for Numeric types while accessing hive tables (Registered on Hbase,created through Phoenix)

2016-03-03 Thread Divya Gehlot
Hi, I am registering hive table on Hbase CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:AGE") TBLPROPERTIES ("hbase.table.name" = "TEST",

Re: Configuring/Optimizing Spark

2016-03-03 Thread Jesse F Chen
So you have 90GB total memory, and 24 total cores. Let's say you want to use 80% of all that memory (leaving memory for other components) so you have 72GB to use. You want to take advantage of all the cores and memory. So this would be close: executor size = 6g number of executors = 12 cores

Re: an OOM while persist as DISK_ONLY

2016-03-03 Thread Eugen Cepoi
We are in the process of upgrading to spark 1.6 from 1.4, and had a hard time getting some of our more memory/join intensive jobs to work (rdd caching + a lot of shuffling). Most of the time they were getting killed by yarn. Increasing the overhead was of course an option but the increase to make

Re: an OOM while persist as DISK_ONLY

2016-03-03 Thread Ted Yu
bq. that solved some problems Is there any problem that was not solved by the tweak ? Thanks On Thu, Mar 3, 2016 at 4:11 PM, Eugen Cepoi wrote: > You can limit the amount of memory spark will use for shuffle even in 1.6. > You can do that by tweaking the

Re: an OOM while persist as DISK_ONLY

2016-03-03 Thread Andy Dang
Spark shuffling algorithm is very aggressive in storing everything in RAM, and the behavior is worse in 1.6 with the UnifiedMemoryManagement. At least in previous versions you can limit the shuffler memory, but Spark 1.6 will use as much memory as it can get. What I see is that Spark seems to

Configuring/Optimizing Spark

2016-03-03 Thread Chadha Pooja
Hi I am trying to understand the best parameter settings for processing a 12.5 GB file with my Spark Cluster. I am using a 3 node cluster, with 8 cores and 30 Gib of RAM on each node. I used Cloudera's top 5 mistakes articles and tried the following configurations: spark.executor.instances

getting null values from hive partitioned table after upgrading Spark to 1.5.0

2016-03-03 Thread Jyothi Mandava
Hi All, We recently upgraded spark to 1.5.0 on CDH and unable to get proper result for hive partitioned tables from Spark after that. Most of the fields are null though they have values in the table. Same query is working from hive CLI. There is no problem with non-partitioned tables query

Re: convert SQL multiple Join in Spark

2016-03-03 Thread Mich Talebzadeh
Absolutely best to use sql here even in spark shell look at this example using SQL val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) println ("\nStarted at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') ").collect.foreach(println)

RE: convert SQL multiple Join in Spark

2016-03-03 Thread Mohammed Guller
Why not use Spark SQL? Mohammed Author: Big Data Analytics with Spark From: Vikash Kumar [mailto:vikashsp...@gmail.com] Sent: Wednesday, March 2, 2016 8:29 PM To: user@spark.apache.org Subject: convert SQL multiple

RE: Stage contains task of large size

2016-03-03 Thread Mohammed Guller
Just to elaborate more on what Silvio wrote below, check whether you are referencing a class or object member variable in a function literal/closure passed to one of the RDD methods. Mohammed Author: Big Data Analytics with

Re: Spark 1.5 on Mesos

2016-03-03 Thread Tim Chen
Ah I see, I think it's because you've launched the Mesos slave in a docker container, and when you launch also the executor in a container it's not able to mount in the sandbox to the other container since the slave is in a chroot. Can you try mounting in a volume from the host when you launch

spark master ui to proxy app and worker ui

2016-03-03 Thread Gurvinder Singh
Hi, I am wondering if it is possible for the spark standalone master UI to proxy app/driver UI and worker UI. The reason for this is that currently if you want to access UI of driver and worker to see logs, you need to have access to their IP:port which makes it harder to open up from networking

Re: Spark SQL Json Parse

2016-03-03 Thread Michael Segel
Why do you want to write out NULL if the column has no data? Just insert the fields that you have. > On Mar 3, 2016, at 9:10 AM, barisak wrote: > > Hi, > > I have a problem with Json Parser. I am using spark streaming with > hiveContext for keeping json format

Re: SFTP Compressed CSV into Dataframe

2016-03-03 Thread Benjamin Kim
Sumedh, How would this work? The only server that we have is the Oozie server with no resources to run anything except Oozie, and we have no sudo permissions. If we run the mount command using the shell action which can run on any node of the cluster via YARN, then the spark job will not be

Re: [Proposal] Enabling time series analysis on spark metrics

2016-03-03 Thread Karan Kumar
Precisely. Found a JIRA in this regard : SPARK-10610 On Wed, Mar 2, 2016 at 3:36 AM, Reynold Xin wrote: > Is the suggestion just to use a different config (and maybe fallback to > appid) in order to publish metrics? Seems

Re: Spark sql query taking long time

2016-03-03 Thread Gourav Sengupta
Hi, using dataframes you can use SQL, and SQL has an option of JOIN, BETWEEN, IN and LIKE OPERATIONS. Why would someone use a dataframe and then use them as RDD's? :) Regards, Gourav Sengupta On Thu, Mar 3, 2016 at 4:28 PM, Sumedh Wale wrote: > On Thursday 03 March 2016

Re: Spark sql query taking long time

2016-03-03 Thread Sumedh Wale
On Thursday 03 March 2016 09:15 PM, Gourav Sengupta wrote: Hi, why not read the table into a dataframe directly using SPARK CSV package. You are trying to solve the problem the

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
One more thing -- just to set aside any question about my specific schema or data, I used the sample schema and data record from Oracle's documentation on Avro support. It's a pretty simple schema: https://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/jsonbinding-overview.html When I

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
It's write once table. Mainly used for read/query intensive application. We in fact generate comma separated string from an array and store it in single column qualifer. I will look into approach you suggested. Reading of this table is via spark. Its analytic application which loads hbase table

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
No, the name of the field is *enum1* -- the name of the field's type is *enum1_values*. It should not be looking for enum1_values -- that's not the way the specification states that the standard works, and it's not how any other implementation reads Avro data. For what it's worth, if I change

Re: Serializing collections in Datasets

2016-03-03 Thread Daniel Siegmann
I have confirmed this is fixed in Spark 1.6.1 RC 1. Thanks. On Tue, Feb 23, 2016 at 1:32 PM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Yes, I will test once 1.6.1 RC1 is released. Thanks. > > On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust > wrote: > >> I

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
Hi Ted, I'd say about 70th percentile keys have 2 columns each having a string of 20k comma separated values. Top few hundred row keys have about 100-700k comma separated values for those keys. I know that an extra FAT table. Yes I can remove "hConf.setBoolean("hbase.cluster.distributed",

Re: Using Spark SQL / Hive on AWS EMR

2016-03-03 Thread Gourav Sengupta
Hi, Why are you trying to load data into HIVE and then access it via hiveContext? (by the way hiveContext tables are not visible in the sqlContext). Please read the data directly into a SPARK dataframe and then register it as a temp table to run queries on it. Regards, Gourav On Thu, Mar 3,

Re: Spark sql query taking long time

2016-03-03 Thread Gourav Sengupta
Hi, why not read the table into a dataframe directly using SPARK CSV package. You are trying to solve the problem the round about way. Regards, Gourav Sengupta On Thu, Mar 3, 2016 at 12:33 PM, Sumedh Wale wrote: > On Thursday 03 March 2016 11:03 AM, Angel Angel wrote: >

Spark SQL Json Parse

2016-03-03 Thread barisak
Hi, I have a problem with Json Parser. I am using spark streaming with hiveContext for keeping json format tweets. The flume collects tweets and sink to hdfs path. My spark streaming job checks the hdfs path and convert coming json tweets and insert them to hive table. My problem is that ; Some

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Ted Yu
bq. hConf.setBoolean("hbase.cluster.distributed", true) Not sure why the above is needed. If hbase-site.xml is on the classpath, it should contain the above setting already. FYI On Thu, Mar 3, 2016 at 6:08 AM, Ted Yu wrote: > From the log snippet you posted, it was not

mapWithState not compacting removed state

2016-03-03 Thread Iain Cundy
Hi All I'm aggregating data using mapWithState with a timeout set in 1.6.0. It broadly works well and by providing access to the key and the time in the callback allows a much more elegant solution for time based aggregation than the old updateStateByKey function. However there seems to be a

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Ted Yu
From the log snippet you posted, it was not clear why connection got lost. You can lower the value for caching and see if GC activity gets lower. How wide are the rows in hbase table ? Thanks > On Mar 3, 2016, at 1:01 AM, Nirav Patel wrote: > > so why does

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-03 Thread Joshua Sorrell
Thank you, Jules, for your in depth answer. And thanks, everyone else, for the additional info. This was very helpful. I think for proof of concept, we'll go with pyspark for dev speed. Then we'll reevaluate from there. Any timeline for when GraphX will have python support? On Wed, Mar 2, 2016

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Igor Berman
your field name is *enum1_values* but you have data { "foo1": "test123", *"enum1"*: "BLUE" } i.e. since you defined enum and not union(null, enum) it tries to find value for enum1_values and doesn't find one... On 3 March 2016 at 11:30, Chris Miller wrote: > I've been

Using Spark SQL / Hive on AWS EMR

2016-03-03 Thread Afshartous, Nick
Hi, On AWS EMR 4.2 / Spark 1.5.2, I tried the example here https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#hive-tables to load data from a file into a Hive table. scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> sqlContext.sql("CREATE

Re: Spark sql query taking long time

2016-03-03 Thread Sumedh Wale
On Thursday 03 March 2016 11:03 AM, Angel Angel wrote: Hello Sir/Madam, I am writing one application using spark sql. i made the vary big table using the following command  val

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-03-03 Thread Lior Chaga
No reference. I opened a ticket about missing documentation for it, and was answered by Sean Owen that this is not meant for spark users. I explained that it's an issue, but no news so far. As for the memory management, I'm not experienced with it, but I suggest you read:

OutOfMemorryError after chaning to .persist(StorageLevel.MEMORY_ONLY_SER)

2016-03-03 Thread Jake Yoon
Hi, Spark users. I am getting the following OutOfMemoryError: Java heap space after changing to StorageLevel.MEMORY_ONLY_SER. MEMORY_AND_DISK_SER also throws the same error. I thought DISK option should put unfitting blocks to the disk. What could cause the OOM in such situation? Is there any

Spark mllin k-means taking too much time

2016-03-03 Thread Priya Ch
Hi Team, I am running k-means algorithm on KDD 1999 data set ( http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). I am running the algorithm for different values of k as such - 5,10,15,40. The data set is 709 MB. I have placed the file in hdfs with a block size of 128MB (6 blocks).

Re: building a package with sbt failing on unresolved dependency

2016-03-03 Thread Mich Talebzadeh
Resolved. Had to force fresh downloads again. Went to home directory ~ and deleted the following sub-directories rm -rf .sbt rm -rf .ivy2 and reran sbt project Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Add Jars to Master/Worker classpath

2016-03-03 Thread Matthias Niehoff
Hi, driver and executor path does not work because its for the driver and executor and not for the master and worker jvm. It works fine for driver/executor but we want to add classes to the master/worker. The SPARK_DIST_CLASSPATH looks good, will try this! Thanks! 2016-03-02 18:35 GMT+01:00

Reducer doesn't operate in parallel

2016-03-03 Thread octavian.ganea
Hi, I've seen in a few cases that when calling a reduce operation, it is executed sequentially rather than in parallel. For example, I have the following code that performs a simple word counting on very big data using hashmaps (instead of (word,1) pairs that would overflow the memory at

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
I've been digging into this a little deeper. Here's what I've found: test1.avsc: { "namespace": "com.cmiller", "name": "test1", "type": "record", "fields": [ { "name":"foo1", "type":"string" } ] } test2.avsc: {

building a package with sbt failing on unresolved dependency

2016-03-03 Thread Mich Talebzadeh
I am just doing a sample package build with spt as described in here using: name := "Simple Project" version := "1.0" scalaVersion := "2.10.5" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" It worked before but now

How to display the web ui when running Spark on YARN?

2016-03-03 Thread Shady Xu
Hi all, I am running Spark in yarn-client mode, but every time I access the web ui, the browser redirect me to one of the worker nodes and shows nothing. The url looks like http://hadoop-node31.company.com:8088/proxy/application_1453797301246_120264 . I googled a lot and found some possible

Re: Sorting the RDD

2016-03-03 Thread Alex Dzhagriev
Hi Angel, Your x() functions returns an Any type, thus there is no Ordering[Any] defined in the scope and it doesn't make sense to define one. Basically it's the same as to order java Objects, which don't have any fields. So the problem is with your x() function, make sure it returns something

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
so why does 'saveAsHadoopDataset' incurs so much memory pressure? Should I try to reduce hbase caching value ? On Wed, Mar 2, 2016 at 7:51 AM, Nirav Patel wrote: > Hi, > > I have a spark jobs that runs on yarn and keeps failing at line where i do : > > > val hConf =

MetadataFetchFailedException: Missing an output location for shuffle 0

2016-03-03 Thread Pierre Villard
Hi, I have set up a spark job and it keeps failing even though I tried a lot of different configurations regarding memory parameters (as suggested in other threads I read). My configuration: Cluster of 4 machines: 4vCPU, 16Go RAM. YARN version: 2.7.1 Spark version: 1.5.2 I tried a lot of

Re: Spark executor killed without apparent reason

2016-03-03 Thread Saisai Shao
If it is due to heartbeat problem and driver explicitly killed the executors, there should be some driver logs mentioned about it. So you could check the driver log about it. Also container (executor) logs are useful, if this container is killed, then there'll be some signal related logs, like

Re: Spark executor killed without apparent reason

2016-03-03 Thread Nirav Patel
There was nothing in nodemanager logs that indicated why container was killed. Here's the guess: Since killed executors were experiencing high GC activities (full GC) before death they most likely failed to respond to heart beat to driver or nodemanager and got killed due to it. This is more