Build Failure

2015-10-08 Thread shahid qadri
hi I tried to build latest master branch of spark build/mvn -DskipTests clean package Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [03:46 min] [INFO] Spark Project Test Tags SUCCESS [01:02 min] [INFO] Spark Project Laun

Re: Build Failure

2015-10-08 Thread Jean-Baptiste Onofré
Hi, I just tried and it works for me (I don't have any Maven mirror on my subnet). Can you try again ? Maybe it was a temporary issue to access to Maven central. The artifact is present on central: http://repo1.maven.org/maven2/com/twitter/algebird-core_2.10/0.9.0/ Regards JB On 10/08/20

Re: Build Failure

2015-10-08 Thread shahid ashraf
Yes this was temporary issue. Build Success [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 10.791 s] [INFO] Spark Project Test Tags SUCCESS [ 3.743 s] [INFO] Spark Project Launcher . SUCC

Re: Build Failure

2015-10-08 Thread Jean-Baptiste Onofré
Cool. Glad you succeed your build ;) Regards JB On 10/08/2015 10:25 AM, shahid ashraf wrote: Yes this was temporary issue. Build Success [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 10.791 s] [INFO] Spark Project Test Tags

sql query orc slow

2015-10-08 Thread patcharee
Hi, I am using spark sql 1.5 to query a hive table stored as partitioned orc file. We have the total files is about 6000 files and each file size is about 245MB. What is the difference between these two query methods below: 1. Using query on hive table directly hiveContext.sql("select col1,

Spark ganglia jClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink

2015-10-08 Thread gtanguy
I build spark with ganglia : $SPARK_HOME/build/sbt -Pspark-ganglia-lgpl -Phadoop-1 -Phive -Phive-thriftserver assembly ... [info] Including from cache: metrics-ganglia-3.1.0.jar ... In the master log : ERROR actor.OneForOneStrategy: org.apache.spark.metrics.sink.GangliaSink

How can I read file from HDFS i sparkR from RStudio

2015-10-08 Thread Amit Behera
Hi All, I am very new to SparkR. I am able to run a sample code from example given in the link : http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/ Then I am trying to read a file from HDFS in RStudio, but unable to read. Below is my code. *Sy

Re: Spark Streaming: Doing operation in Receiver vs RDD

2015-10-08 Thread Iulian Dragoș
You can have a look at http://spark.apache.org/docs/latest/streaming-programming-guide.html#receiver-reliability for details on Receiver reliability. If you go the receiver way you'll need to enable Write Ahead Logs to ensure no data loss. In Kafka direct you don't have this problem. Regarding whe

Re: Is coalesce smart while merging partitions?

2015-10-08 Thread Iulian Dragoș
It's smart. Have a look at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L123 On Thu, Oct 8, 2015 at 4:00 AM, Cesar Flores wrote: > It is my understanding that the default behavior of coalesce function when > the user reduce the number of

Re: Optimal way to avoid processing null returns in Spark Scala

2015-10-08 Thread Iulian Dragoș
On Wed, Oct 7, 2015 at 6:42 PM, swetha wrote: Hi, > > I have the following functions that I am using for my job in Scala. If you > see the getSessionId function I am returning null sometimes. If I return > null the only way that I can avoid processing those records is by filtering > out null reco

Best practises to clean up RDDs for old applications

2015-10-08 Thread Jens Rantil
Hi, I have a couple of old application RDDs under /var/lib/spark/rdd that haven't been properly cleaned up after themselves. Example: # du -shx /var/lib/spark/rdd/* 44K /var/lib/spark/rdd/liblz4-java1011984124691611873.so 48K /var/lib/spark/rdd/snappy-1.0.5-libsnappyjava.so 2.3G /var/lib/spark/rd

Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan
Hello, Can anyone point me to a good example of updateStateByKey with an initial RDD? I am seeing a compile time error when following the API. Regards, Bryan Jeffrey

Spark 1.5.1 standalone cluster - wrong Akka remoting config?

2015-10-08 Thread baraky
Doing my firsts steps with Spark, I'm facing problems submitting jobs to cluster from the application code. Digging the logs, I noticed some periodic WARN messages on master log: 15/10/08 13:00:00 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.16

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Aniket Bhatnagar
Here is an example: val interval = 60 * 1000 val counts = eventsStream.map(event => { (event.timestamp - event.timestamp % interval, event) }).updateStateByKey[Long](updateFunc = (events: Seq[Event], prevStateOpt: Option[Long]) => { val prevCount = prevStateOpt.getOrElse(0L) val newCount = p

Spark 1.5.1 standalone cluster - wrong Akka remoting config?

2015-10-08 Thread Barak Yaish
Doing my firsts steps with Spark, I'm facing problems submitting jobs to cluster from the application code. Digging the logs, I noticed some periodic WARN messages on master log: 15/10/08 13:00:00 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.16

Re: Spark 1.5.1 standalone cluster - wrong Akka remoting config?

2015-10-08 Thread michal.klo...@gmail.com
Try setting spark.driver.host to the actual ip or hostname of the box submitting the work. More info the networking section in this link: http://spark.apache.org/docs/latest/configuration.html Also check the spark config for your application for these driver settings in the application web UI a

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan Jeffrey
Aniket, Thank you for the example - but that's not quite what I'm looking for. I've got a call to updateStateByKey that looks like the following: dstream.map(x => (x.keyWithTime, x)) .updateStateByKey(ProbabilityCalculator.updateCountsOfProcessGivenRole) def updateCountsOfProcessGivenRole(a : Se

Launching EC2 instances with Spark compiled for Scala 2.11

2015-10-08 Thread Theodore Vasiloudis
Hello, I was wondering if there is an easy way launch EC2 instances which have a Spark built for Scala 2.11. The only way I can think of is to prepare the sources for 2.11 as shown in the Spark build instructions ( http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211), u

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Aniket Bhatnagar
Ohh I see. You could have to add underscore after ProbabilityCalculator.updateCountsOfProcessGivenRole. Try: dstream.map(x => (x.keyWithTime, x)) .updateStateByKey(ProbabilityCalculator.updateCountsOfProcessGivenRole _, new HashPartitioner(3), initialProcessGivenRoleRdd) Here is an example: def c

Re: Launching EC2 instances with Spark compiled for Scala 2.11

2015-10-08 Thread Aniket Bhatnagar
Is it possible for you to use EMR instead of EC2? If so, you may be able to tweak EMR bootstrap scripts to install your custom spark build. Thanks, Aniket On Thu, Oct 8, 2015 at 5:58 PM Theodore Vasiloudis < theodoros.vasilou...@gmail.com> wrote: > Hello, > > I was wondering if there is an easy

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan Jeffrey
Honestly, that's what I already did - I am working to test it now. It looked like 'add an underscore' was ignoring some implicit argument that I was failing to provide. On Thu, Oct 8, 2015 at 8:34 AM, Aniket Bhatnagar wrote: > Ohh I see. You could have to add underscore > after ProbabilityCalcu

Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Maheshakya Wijewardena
Hi, Suppose there is data frame called goods with columns "barcode" and "items". Some of the values in the column "items" can be null. I want to the barcode and the respective items from the table adhering the following rules: - If "items" is null -> output 0 - else -> output "items" ( the

Re: Is coalesce smart while merging partitions?

2015-10-08 Thread Daniel Darabos
> For example does spark try to merge the small partitions first or the election of partitions to merge is random? It is quite smart as Iulian has pointed out. But it does not try to merge small partitions first. Spark doesn't know the size of partitions. (The partitions are represented as Iterato

RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi all, would this be a bug?? val ws = Window. partitionBy("clrty_id"). orderBy("filemonth_dtt") val nm = "repeatMe" df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm)) stacked_data.filter(stacked_data("repeatMe").isNotNull).or

Re: does KafkaCluster can be public ?

2015-10-08 Thread Cody Koeninger
If anyone is interested in keeping tabs on it, the jira for this is https://issues.apache.org/jira/browse/SPARK-10963 On Wed, Oct 7, 2015 at 3:16 AM, Erwan ALLAIN wrote: > Thanks guys ! > > On Wed, Oct 7, 2015 at 1:41 AM, Cody Koeninger wrote: > >> Sure no prob. >> >> On Tue, Oct 6, 2015 at 6:

Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Tracewski, Lukasz
Hi, Many people interpret this slide from Databricks https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png as indication that Dataframes API is going to be the main processing unit of Spark and sole access point to MLlib, Streaming and such. Is it true? My impression was that Datafra

JDBC thrift server

2015-10-08 Thread Younes Naguib
Hi, We've been using the JDBC thrift server for a couple of weeks now and running queries on it like a regular RDBMS. We're about to deploy it in a shared production cluster. Any advice, warning on a such setup. Yarn or Mesos? How about dynamic resource allocation in a already running thrift ser

How to register udf with Any or generic Type in spark

2015-10-08 Thread dugasani jcreddy
Hi,   I have a requirement  to use udf whose return type is not known in spark data frame sql. I have below requirementfunction takes either String  Or Boolean  data types.function returns 1 or 0 based on whether Input argument is True or False. If input string is in the form Integer or long ret

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-08 Thread Sun, Rui
Can you extract the spark-submit command from the console output, and run it on the Shell, and see if there is any error message? From: Khandeshi, Ami [mailto:ami.khande...@fmr.com] Sent: Wednesday, October 7, 2015 9:57 PM To: Sun, Rui; Hossein Cc: akhandeshi; user@spark.apache.org Subject: RE: S

Re: Parquet file size

2015-10-08 Thread Cheng Lian
How many tasks are there in the write job? Since each task may write one file for each partition, you may end up with taskNum * 31 files. Increasing SPLIT_MINSIZE does help reducing task number. Another way to address this issue is to use DataFrame.coalesce(n) to shrink task number to n explic

Re: JDBC thrift server

2015-10-08 Thread Sathish Kumaran Vairavelu
Which version of spark you are using? You might encounter SPARK-6882 if Kerberos is enabled. -Sathish On Thu, Oct 8, 2015 at 10:46 AM Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi, > > > > We’ve been using the JDBC thrift server f

Re: sql query orc slow

2015-10-08 Thread Zhan Zhang
Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee wrote: > Hi, > > I am using spark sql 1.5 to query a hive table stored as partitioned orc > file. We have the total files is about 6000 files and each file size

RE: JDBC thrift server

2015-10-08 Thread Younes Naguib
Sorry, we’re running 1.5.1. y From: Sathish Kumaran Vairavelu [mailto:vsathishkuma...@gmail.com] Sent: October-08-15 12:39 PM To: Younes Naguib; user@spark.apache.org Subject: Re: JDBC thrift server Which version of spark you are using? You might encounter SPARK-6882

Error executing using alternating least square

2015-10-08 Thread haridass saisriram
Hi, I downloaded spark 1.5.0 on windows 7 and built it using build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package and tried running the Alternating least square example ( http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html ) using spark-shell im

Using Sqark SQL mapping over an RDD

2015-10-08 Thread Afshartous, Nick
Hi, Am using Spark, 1.5 in latest EMR 4.1. I have an RDD of String scala> deviceIds res25: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[18] at map at :28 and then when trying to map over the RDD while attempting to run a sql query the result is a NullPointerException scala

Re: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Michael Armbrust
You can't do nested operations on RDDs or DataFrames (i.e. you can't create a DataFrame from within a map function). Perhaps if you explain what you are trying to accomplish someone can suggest another way. On Thu, Oct 8, 2015 at 10:10 AM, Afshartous, Nick wrote: > > Hi, > > Am using Spark, 1.5

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Michael Armbrust
Hmm, that looks like it should work to me. What version of Spark? What is the schema of goods? On Thu, Oct 8, 2015 at 6:13 AM, Maheshakya Wijewardena wrote: > Hi, > > Suppose there is data frame called goods with columns "barcode" and > "items". Some of the values in the column "items" can be

Re: Default size of a datatype in SparkSQL

2015-10-08 Thread Michael Armbrust
Its purely for estimation, when guessing when its safe to do a broadcast join. We picked a random number that we thought was larger than the common case (its better to over estimate to avoid OOM). On Wed, Oct 7, 2015 at 10:11 PM, vivek bhaskar wrote: > I want to understand whats use of default

RE: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Afshartous, Nick
> You can't do nested operations on RDDs or DataFrames (i.e. you can't create a > DataFrame from within a map function). Perhaps if you explain what you are > trying to accomplish someone can suggest another way. The code below what I had in mind. For each Id, I'd like to run a query using t

Applicative logs on Yarn

2015-10-08 Thread nibiau
Hello, I submit spark streaming inside Yarn, I have configured yarn to generate custom logs. It works fine and yarn aggregate very well the logs inside HDFS, nevertheless the log files are only usable via "yarn logs" command. I would prefer to be able to navigate inside via hdfs command like a te

Re: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Michael Armbrust
You are probably looking for a groupby instead: sqlContext.sql("SELECT COUNT(*) FROM ad_info GROUP BY deviceId") On Thu, Oct 8, 2015 at 10:27 AM, Afshartous, Nick wrote: > > > You can't do nested operations on RDDs or DataFrames (i.e. you can't > create a DataFrame from within a map function).

Re: How to register udf with Any or generic Type in spark

2015-10-08 Thread Michael Armbrust
You can't do this with UDFs. All data types have to be known ahead of time. On Thu, Oct 8, 2015 at 8:59 AM, dugasani jcreddy < jcredd...@yahoo.com.invalid> wrote: > Hi, > > I have a requirement to use udf whose return type is not known in spark > data frame sql. > I have below requirement >

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Michael Armbrust
Don't worry, the ability to work with domain objects and lambda functions is not going to go away. However, we are looking at ways to leverage Tungsten's improved performance when processing structured data. More details can be found here: https://issues.apache.org/jira/browse/SPARK- On Thu,

Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
Which version of Spark? On Thu, Oct 8, 2015 at 7:25 AM, wrote: > Hi all, would this be a bug?? > > val ws = Window. > partitionBy("clrty_id"). > orderBy("filemonth_dtt") > > val nm = "repeatMe" > df.select(df.col("*"), rowNumber().over(ws).cast("in

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi, thanks for looking into. v1.5.1. I am really worried. I dont have hive/hadoop for real in the environment. Saif From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, October 08, 2015 2:57 PM To: Ellafi, Saif A. Cc: user Subject: Re: RowNumber in HiveContext returns null or ne

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
It turns out this does not happen in local[32] mode. Only happens when submiting to standalone cluster. Don’t have YARN/MESOS to compare. Will keep diagnosing. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October 08, 2015 3:01 PM To: mich...@data

Re: Applicative logs on Yarn

2015-10-08 Thread Ted Yu
This question seems better suited for u...@hadoop.apache.org FYI On Thu, Oct 8, 2015 at 10:37 AM, wrote: > Hello, > I submit spark streaming inside Yarn, I have configured yarn to generate > custom logs. > It works fine and yarn aggregate very well the logs inside HDFS, > nevertheless the log f

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Jerry Lam
I just read the article by ogirardot but I don’t agree It is like saying pandas dataframe is the sole data structure for analyzing data in python. Can Pandas dataframe replace Numpy array? The answer is simply no from an efficiency perspective for some computations. Unless there is a computer s

How to increase Spark partitions for the DataFrame?

2015-10-08 Thread unk1102
Hi I have the following code where I read ORC files from HDFS and it loads directory which contains 12 ORC files. Now since HDFS directory contains 12 files it will create 12 partitions by default. These directory is huge and when ORC files gets decompressed it becomes around 10 GB how do I increas

failed spark job reports on YARN as successful

2015-10-08 Thread Lan Jiang
Hi, there I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in “yarn-client” mode. The job itself failed due to YARN kills several executor containers because the containers exceeded the memory limit posed by YARN. However, when I went to the YARN resource manager site,

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
The partition number should be the same as the HDFS block number instead of file number. Did you confirmed from the spark UI that only 12 partitions were created? What is your ORC orc.stripe.size? Lan > On Oct 8, 2015, at 1:13 PM, unk1102 wrote: > > Hi I have the following code where I read

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Repartition and default parallelism to 1, in cluster mode, is still broken. So the problem is not the parallelism, but the cluster mode itself. Something wrong with HiveContext + cluster mode. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October

Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-08 Thread unk1102
Hi as recommended I am caching my Spark job dataframe as dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in Spark job UI is this persist stage runs for so long showing 10 GB of shuffle read and 5 GB of shuffle write it takes to long to finish and because of that sometimes my Spa

Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
Can you open a JIRA? On Thu, Oct 8, 2015 at 11:24 AM, wrote: > Repartition and default parallelism to 1, in cluster mode, is still > *broken*. > > > > So the problem is not the parallelism, but the cluster mode itself. > Something wrong with HiveContext + cluster mode. > > > > Saif > > > > *From

Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv
Hi, I'm inserting into a partitioned ORC table using an insert sql statement passed via HiveContext. The performance I'm getting is pretty bad and I was wondering if there are ways to speed things up. Would saving the DF like this df.write().mode(SaveMode.Append).partitionBy("date").saveAsTable("Ta

Re: Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv
Forgot to mention that my insert is a multi table insert : sqlContext2.sql("""from avro_events lateral view explode(usChnlList) usParamLine as usParamLine lateral view explode(dsChnlList) dsParamLine as dsParamLine insert into table UpStreamParam partiti

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha
Hi Lan, thanks for the response yes I know and I have confirmed in UI that it has only 12 partitions because of 12 HDFS blocks and hive orc file strip size is 33554432. On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang wrote: > The partition number should be the same as the HDFS block number instead >

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
Hmm, that’s odd. You can always use repartition(n) to increase the partition number, but then there will be shuffle. How large is your ORC file? Have you used NameNode UI to check how many HDFS blocks each ORC file has? Lan > On Oct 8, 2015, at 2:08 PM, Umesh Kacha wrote: > > Hi Lan, thank

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha
Hi Lan thanks for the reply. I have tried to do the following but it did not increase partition DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/ files/").repartition(100); Yes I have checked in namenode ui ORC files contains 12 files/blocks of 128 MB each and ORC files whe

Re: Spark REST Job server feedback?

2015-10-08 Thread Tim Smith
I am curious too - any comparison between the two. Looks like one is Datastax sponsored and the other is Cloudera. Other than that, any major/core differences in design/approach? Thanks, Tim On Mon, Sep 28, 2015 at 8:32 AM, Ramirez Quetzal wrote: > Anyone has feedback on using Hue / Spark Job

Streaming DirectKafka assertion errors

2015-10-08 Thread Roman Garcia
I'm running Spark Streaming using Kafka Direct stream, expecting exactly-once semantics using checkpoints (which are stored onto HDFS). My Job is really simple, it opens 6 Kafka streams (6 topics with 4 parts each) and stores every row to ElasticSearch using ES-Spark integration. This job was work

ValueError: can not serialize object larger than 2G

2015-10-08 Thread XIANDI
File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main process() File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/hadoop/spark/python/pyspark/serializers.py", line 126, in du

Re: Streaming DirectKafka assertion errors

2015-10-08 Thread Cody Koeninger
It sounds like you moved the job from one environment to another? This may sound silly, but make sure (eg using lsof) the brokers the job is connecting to are actually the ones you expect. As far as the checkpoint goes, the log output should indicate whether the job is restoring from checkpoint.

Re: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Ted Yu
See the comment of FramedSerializer() in serializers.py : Serializer that writes objects as a stream of (length, data) pairs, where C{length} is a 32-bit integer and data is C{length} bytes. Hence the limit on the size of object. On Thu, Oct 8, 2015 at 12:56 PM, XIANDI wrote: > File

Re: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Ted Yu
To fix the problem, consider increasing number of partitions for your job. Showing code snippet would help us understand your use case better. Cheers On Thu, Oct 8, 2015 at 1:39 PM, Ted Yu wrote: > See the comment of FramedSerializer() in serializers.py : > > Serializer that writes objects

FW: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Xiandi Zhang
--def parse_record(x):formatted = list(x[0])if (type(x[1])!=list) & (type(x[1])!=tuple):formatted.append(x[1])else: formatted.extend(

unsubscribe

2015-10-08 Thread Jürgen Fey

Re: unsubscribe

2015-10-08 Thread Ted Yu
Take a look at the first section of: http://spark.apache.org/community On Thu, Oct 8, 2015 at 2:10 PM, Jürgen Fey wrote: > >

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Ted Yu
bq. contains 12 files/blocks Looks like you hit the limit of parallelism these files can provide. If you have larger dataset, you would have more partitions. On Thu, Oct 8, 2015 at 12:21 PM, Umesh Kacha wrote: > Hi Lan thanks for the reply. I have tried to do the following but it did > not inc

Unexplained sleep time

2015-10-08 Thread yael aharon
Hello, I am working on improving the performance of our Spark on Yarn applications. Scanning through the logs I found the following lines: [2015-10-07T16:25:17.245-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:yarn] Started progress reporter thread - sleep

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread Tian Zhang
I hit this issue with spark 1.3.0 stateful application (with updateStateByKey) function on mesos. It will fail after running fine for about 24 hours. The error stack trace as below, I checked ulimit -n and we have very large numbers set on the machines. What else can be wrong? 15/09/27 18:45:11 W

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread DB Tsai
Try to run to see actual ulimit. We found that mesos overrides the ulimit which causes the issue. import sys.process._ val p = 1 to 100 val rdd = sc.parallelize(p, 100) val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect Sincerely, DB Tsai

Architecture for a Spark batch job.

2015-10-08 Thread Renato Perini
I have started a project using Spark 1.5.1 consisting of several jobs I launch (actually manually) using shell scripts against a small Spark standalone cluster. Those jobs generally read a Cassandra table (using a RDD of type JavaRDD or using plain DataFrames), compute results on that data and

Re: Spark Streaming: Doing operation in Receiver vs RDD

2015-10-08 Thread Tathagata Das
Since it is about encryption and decryption, its also good know how the raw data is actually saved in disk. If the write ahead log is enabled, then the raw data will be saved to the WAL in HDFS. You probably do not want to save decrypted data in that. So its better not to decrupt in the receiver, a

RE: Insert via HiveContext is slow

2015-10-08 Thread Cheng, Hao
I think that’s a known performance issue(Compared to Hive) of Spark SQL in multi-inserts. A workaround is create a temp cached table for the projection first, and then do the multiple inserts base on the cached table. We are actually working on the POC of some similar cases, hopefully it comes

[Spark 1.5] Kinesis receivers not starting

2015-10-08 Thread Bharath Mukkati
Hi Spark Users, I am testing my application on Spark 1.5 and kinesis-asl-1.5. The streaming application starts but I see a ton of stages scheduled for ReceiverTracker (submitJob at ReceiverTracker.scala:557 ). In the driver logs I see this sequence repeat: 15/10/09 00:10:54 INFO INFO

Re: [Spark 1.5] Kinesis receivers not starting

2015-10-08 Thread Tathagata Das
How many executors and cores do you acquire? td On Thu, Oct 8, 2015 at 6:11 PM, Bharath Mukkati wrote: > Hi Spark Users, > > I am testing my application on Spark 1.5 and kinesis-asl-1.5. The > streaming application starts but I see a ton of stages scheduled for > ReceiverTracker (submitJob at R

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Maheshakya Wijewardena
Spark version: 1.4.1 The schema is "barcode STRING, items INT" On Thu, Oct 8, 2015 at 10:48 PM, Michael Armbrust wrote: > Hmm, that looks like it should work to me. What version of Spark? What > is the schema of goods? > > On Thu, Oct 8, 2015 at 6:13 AM, Maheshakya Wijewardena < > mahesha...@w

Error in load hbase on spark

2015-10-08 Thread Roy Wang
I want to load hbase table into spark. JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); *when call hBaseRDD.count(),got error.* Caused by: java.lang.IllegalStateException: The input format instance has not been properly initiali

Re: Error in load hbase on spark

2015-10-08 Thread Ted Yu
One possibility was that hbase config, including hbase.zookeeper.quorum, was not passed to your job. hbase-site.xml should be on the classpath. Can you show snippet of your code ? Looks like you were running against hbase 1.x Cheers On Thu, Oct 8, 2015 at 7:29 PM, Roy Wang wrote: > > I want t

Re: Streaming DirectKafka assertion errors

2015-10-08 Thread Roman Garcia
Thanks Cody for your help. Actually i found out it was a issue on our network. After doing a ping from spark node to kafka node i found there were dup packages. After rebooting the kafka node everything went back to normal! Thanks for your help! Roman El jue., 8 de octubre de 2015 17:13, Cody Koen

Re: Re: Error in load hbase on spark

2015-10-08 Thread Ted Yu
The second code snippet is similar to: examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala See the comment in HBaseTest.scala : // please ensure HBASE_CONF_DIR is on classpath of spark driver // e.g: set it through spark.driver.extraClassPath property // in spark-default

RE: How can I read file from HDFS i sparkR from RStudio

2015-10-08 Thread Sun, Rui
Amit, sqlContext <- sparkRSQL.init(sc) peopleDF <- read.df(sqlContext, "hdfs://master:9000/sears/example.csv") have you restarted the R session in RStudio between the two lines? From: Amit Behera [mailto:amit.bd...@gmail.com] Sent: Thursday, October 8, 2015 5:59 PM To: user@spark.apache.org Sub

error in sparkSQL 1.5 using count(1) in nested queries

2015-10-08 Thread Jeff Thompson
After upgrading from 1.4.1 to 1.5.1 I found some of my spark SQL queries no longer worked. Seems to be related to using count(1) or count(*) in a nested query. I can reproduce the issue in a pyspark shell with the sample code below. The ‘people’ table is from spark-1.5.1-bin-hadoop2.4/ examples/

weird issue with sqlContext.createDataFrame - pyspark 1.3.1

2015-10-08 Thread ping yan
I really cannot figure out what this is about.. (tried to import pandas, in case that is a dependency, but it didn't help.) >>> from pyspark.sql import SQLContext >>> sqlContext=SQLContext(sc) >>> sqlContext.createDataFrame(l).collect() Traceback (most recent call last): File "", line 1, in F

Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-08 Thread Hyukjin Kwon
Hi all, While wring some parquet files by Spark, I found it actually only writes the parquet files with writer version1. This differs encoding types of the file. Is this intendedly fixed for some reasons? I changed codes and tested to write this as writer version2 and it looks fine. In more d

Re: sql query orc slow

2015-10-08 Thread patcharee
Yes, the predicate pushdown is enabled, but still take longer time than the first method BR, Patcharee On 08. okt. 2015 18:43, Zhan Zhang wrote: Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee wrote: Hi,