Null array of cols

2017-10-24 Thread Mohit Anchlia
I am trying to understand the best way to handle the scenario where null array "[]" is passed. Can somebody suggest if there is a way to filter out such records. I've tried numerous things including using dataframe.head().isEmpty but pyspark doesn't recognize isEmpty even though I see it in the

Spark directory partition name

2017-10-16 Thread Mohit Anchlia
When spark writes the partition it writes in the format as: =/key value> Is there a way to have spark write only by keyvalue?

Re: Write Ahead Log

2016-06-08 Thread Mohit Anchlia
applies to Streaming. > > On Wed, Jun 8, 2016 at 3:14 PM, Mohit Anchlia <mohitanch...@gmail.com> > wrote: > >> Is something similar to park.streaming.receiver.writeAheadLog.enable >> available on SparkContext? It looks like it only works for spark streaming. >> > >

Write Ahead Log

2016-06-08 Thread Mohit Anchlia
Is something similar to park.streaming.receiver.writeAheadLog.enable available on SparkContext? It looks like it only works for spark streaming.

Re: Dealing with failures

2016-06-08 Thread Mohit Anchlia
On Wed, Jun 8, 2016 at 3:42 AM, Jacek Laskowski <ja...@japila.pl> wrote: > On Wed, Jun 8, 2016 at 2:38 AM, Mohit Anchlia <mohitanch...@gmail.com> > wrote: > > I am looking to write an ETL job using spark that reads data from the > > source, per

Dealing with failures

2016-06-07 Thread Mohit Anchlia
I am looking to write an ETL job using spark that reads data from the source, perform transformation and insert it into the destination. I am trying to understand how spark deals with failures? I can't seem to find the documentation. I am interested in learning the following scenarios: 1. Source

Re: Too many files/dirs in hdfs

2015-08-25 Thread Mohit Anchlia
, which seems like a extra overhead from maintenance and IO perspective. On Mon, Aug 24, 2015 at 2:51 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Any help would be appreciated On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia mohitanch...@gmail.com wrote: My question was how to do

Re: Too many files/dirs in hdfs

2015-08-24 Thread Mohit Anchlia
Any help would be appreciated On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia mohitanch...@gmail.com wrote: My question was how to do this in Hadoop? Could somebody point me to some examples? On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Of course, Java

Re: Too many files/dirs in hdfs

2015-08-19 Thread Mohit Anchlia
after applying your filters 3) Write this StringBuilder to File when you want to write (The duration can be defined as a condition) On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a way to store all the results in one file and keep the file roll over

Re: Too many files/dirs in hdfs

2015-08-18 Thread Mohit Anchlia
-to-single-file-on-hdfs-td21124.html#a21167 or have a separate program which will do the clean up for you. Thanks Best Regards On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Spark stream seems to be creating 0 bytes files even when there is no data. Also, I have

Too many files/dirs in hdfs

2015-08-14 Thread Mohit Anchlia
Spark stream seems to be creating 0 bytes files even when there is no data. Also, I have 2 concerns here: 1) Extra unnecessary files is being created from the output 2) Hadoop doesn't work really well with too many files and I see that it is creating a directory with a timestamp every 1 second.

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Mohit Anchlia
Anchlia mohitanch...@gmail.com wrote: I was able to get this working by using an alternative method however I only see 0 bytes files in hadoop. I've verified that the output does exist in the logs however it's missing from hdfs. On Thu, Aug 13, 2015 at 10:49 AM, Mohit Anchlia mohitanch

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Mohit Anchlia
: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F]) { Did you intend to use outputPath as prefix ? Cheers On Fri, Aug 14, 2015 at 1:36 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Spark 1.3 Code

Executors on multiple nodes

2015-08-14 Thread Mohit Anchlia
I am running on Yarn and do have a question on how spark runs executors on different data nodes. Is that primarily decided based on number of receivers? What do I need to do to ensure that multiple nodes are being used for data processing?

Spark RuntimeException hadoop output format

2015-08-13 Thread Mohit Anchlia
I have this call trying to save to hdfs 2.6 wordCounts.saveAsNewAPIHadoopFiles(prefix, txt); but I am getting the following: java.lang.RuntimeException: class scala.runtime.Nothing$ not org.apache.hadoop.mapreduce.OutputFormat

Re: Spark RuntimeException hadoop output format

2015-08-13 Thread Mohit Anchlia
I was able to get this working by using an alternative method however I only see 0 bytes files in hadoop. I've verified that the output does exist in the logs however it's missing from hdfs. On Thu, Aug 13, 2015 at 10:49 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I have this call trying

Unit Testing

2015-08-12 Thread Mohit Anchlia
Is there a way to run spark streaming methods in standalone eclipse environment to test out the functionality?

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
. Best Ayan On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc. -- Best

Re: ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
After changing the '--deploy_mode client' the program seems to work however it looks like there is a bug in spark when using --deploy_mode as 'yarn'. Should I open a bug? On Tue, Aug 11, 2015 at 3:02 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I see the following line in the log 15/08/11

Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc.

Re: ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
am using it in yarn On Tue, Aug 11, 2015 at 1:52 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am seeing following error. I think it's not able to find some other associated classes as I see $2 in the exception, but not sure what I am missing. 15/08/11 16:00:15 WARN

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
during the batchInterval are partitions of the RDD. Now if you want to repartition based on a key, a shuffle is needed. On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How does partitioning in spark work when it comes to streaming? What's the best way to partition

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
: If you are running on a cluster, the listening is occurring on one of the executors, not in the driver. On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to run this program as a yarn-client. The job seems to be submitting successfully however I don't

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
), and submit your app in client mode on that to see whether you are seeing the process listening on or not. On Mon, Aug 10, 2015 at 12:43 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I've verified all the executors and I don't see a process listening on the port. However, the application

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I do see this message: 15/08/10 19:19:12 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources On Mon, Aug 10, 2015 at 4:15 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am using the same

Checkpoint Dir Error in Yarn

2015-08-07 Thread Mohit Anchlia
I am running in yarn-client mode and trying to execute network word count example. When I connect through nc I see the following in spark app logs: Exception in thread main java.lang.AssertionError: assertion failed: The checkpoint directory has not been set. Please use

SparkR

2015-07-27 Thread Mohit Anchlia
Does SparkR support all the algorithms that R library supports?

Re: Class incompatible error

2015-04-09 Thread Mohit Anchlia
experience with mixed JDK's. Can you try with using single JDK ? Cheers On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: For the build I am using java version 1.7.0_65 which seems to be the same as the one on the spark host. However one is Oracle and the other

start-slave.sh not starting

2015-04-08 Thread Mohit Anchlia
I am trying to start the worker by: sbin/start-slave.sh spark://ip-10-241-251-232:7077 In the logs it's complaining about: Master must be a URL of the form spark://hostname:port I also have this in spark-defaults.conf spark.master spark://ip-10-241-251-232:7077 Did I miss

Class incompatible error

2015-04-08 Thread Mohit Anchlia
I am seeing the following, is this because of my maven version? 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-241-251-232.us-west-2.compute.internal): java.io.InvalidClassException: org.apache.spark.Aggregator; local class incompatible: stream classdesc

Re: Class incompatible error

2015-04-08 Thread Mohit Anchlia
? Cheers On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am seeing the following, is this because of my maven version? 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-241-251-232.us-west-2.compute.internal

Re: WordCount example

2015-04-06 Thread Mohit Anchlia
Interesting, I see 0 cores in the UI? - *Cores:* 0 Total, 0 Used On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das t...@databricks.com wrote: What does the Spark Standalone UI at port 8080 say about number of cores? On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia mohitanch...@gmail.com wrote

Re: WordCount example

2015-04-03 Thread Mohit Anchlia
and Connected to host port which shows that receiver is correctly connected to nc process. Thanks Jerry 2015-03-27 8:45 GMT+08:00 Mohit Anchlia mohitanch...@gmail.com: What's the best way to troubleshoot inside spark to see why Spark is not connecting to nc on port ? I don't see any

Re: WordCount example

2015-04-03 Thread Mohit Anchlia
wrote: How many cores are present in the works allocated to the standalone cluster spark://ip-10-241-251-232:7077 ? On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com wrote: If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this seems to work. I don't

Re: WordCount example

2015-03-30 Thread Mohit Anchlia
I tried to file a bug in git repo however I don't see a link to open issues On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I checked the ports using netstat and don't see any connections established on that port. Logs show only this: 15/03/27 13:50:48 INFO

Re: WordCount example

2015-03-27 Thread Mohit Anchlia
the executor's log to see if there's log like Connecting to host port and Connected to host port which shows that receiver is correctly connected to nc process. Thanks Jerry 2015-03-27 8:45 GMT+08:00 Mohit Anchlia mohitanch...@gmail.com: What's the best way to troubleshoot inside spark to see

Re: WordCount example

2015-03-26 Thread Mohit Anchlia
What's the best way to troubleshoot inside spark to see why Spark is not connecting to nc on port ? I don't see any errors either. On Thu, Mar 26, 2015 at 2:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to run the word count example but for some reason it's not working

WordCount example

2015-03-26 Thread Mohit Anchlia
I am trying to run the word count example but for some reason it's not working as expected. I start nc server on port and then submit the spark job to the cluster. Spark job gets successfully submitting but I never see any connection from spark getting established. I also tried to type words

akka.version error

2015-03-24 Thread Mohit Anchlia
I am facing the same issue as listed here: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-td5615.html Solution mentioned is here: https://gist.github.com/prb/d776a47bd164f704eecb However, I think I don't understand few things: 1) Why are jars being split

Re: Spark streaming alerting

2015-03-23 Thread Mohit Anchlia
(Errors : + rdd.count())) And the alert() function could be anything triggering an email or sending an SMS alert. Thanks Best Regards On Sun, Mar 22, 2015 at 1:52 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a module in spark streaming that lets you listen to the alerts

Re: Load balancing

2015-03-22 Thread Mohit Anchlia
the life cycle of the receiver is, but I don't think sth actually happens when you create the DStream. My guess would be that the receiver is allocated when you call StreamingContext#startStreams(), Regards, Jeff 2015-03-21 21:19 GMT+01:00 Mohit Anchlia mohitanch...@gmail.com: Could

Spark streaming alerting

2015-03-21 Thread Mohit Anchlia
Is there a module in spark streaming that lets you listen to the alerts/conditions as they happen in the streaming module? Generally spark streaming components will execute on large set of clusters like hdfs or Cassandra, however when it comes to alerting you generally can't send it directly from

Load balancing

2015-03-19 Thread Mohit Anchlia
I am trying to understand how to load balance the incoming data to multiple spark streaming workers. Could somebody help me understand how I can distribute my incoming data from various sources such that incoming data is going to multiple spark streaming nodes? Is it done by spark client with help

Unable to connect

2015-03-13 Thread Mohit Anchlia
I am running spark streaming standalone in ec2 and I am trying to run wordcount example from my desktop. The program is unable to connect to the master, in the logs I see, which seems to be an issue with hostname. 15/03/13 17:37:44 ERROR EndpointWriter: dropping message [class

Re: Partitioning

2015-03-13 Thread Mohit Anchlia
(see docs). But use it at your own risk. If you modify the keys, and yet preserve partitioning, the partitioning would not make sense any more as the hash of the keys have changed. TD On Fri, Mar 13, 2015 at 2:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to look

Compilation error

2015-03-12 Thread Mohit Anchlia
I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any import scala.Tuple2 class.

Re: Compilation error

2015-03-12 Thread Mohit Anchlia
It works after sync, thanks for the pointers On Tue, Mar 10, 2015 at 1:22 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I navigated to maven dependency and found scala library. I also found Tuple2.class and when I click on it in eclipse I get invalid LOC header (bad signature

Architecture Documentation

2015-03-11 Thread Mohit Anchlia
Is there a good architecture doc that gives a sufficient overview of high level and low level details of spark with some good diagrams?

Re: Compilation error

2015-03-10 Thread Mohit Anchlia
project classpath. On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize

Compilation error on JavaPairDStream

2015-03-10 Thread Mohit Anchlia
I am getting following error. When I look at the sources it seems to be a scala source, but not sure why it's complaining about it. The method map(FunctionString,R) in the type JavaDStreamString is not applicable for the arguments (new PairFunctionString,String,Integer(){}) And my code has

Re: Compilation error

2015-03-10 Thread Mohit Anchlia
delete that file from local repo and re-sync On Tue, Mar 10, 2015 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I ran the dependency command and see the following dependencies: I only see org.scala-lang. [INFO] org.spark.test:spak-test:jar:0.0.1-SNAPSHOT [INFO

Re: Compilation error

2015-03-10 Thread Mohit Anchlia
On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How do I do that? I haven't used Scala before. Also, linking page doesn't mention that: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so

Re: Compilation error

2015-03-10 Thread Mohit Anchlia
scala.* classes. However, it would be a little bit more correct to depend directly on the scala library classes, but in practice, easiest not to in simple use cases. If you're still having trouble look at the output of mvn dependency:tree On Tue, Mar 10, 2015 at 6:32 PM, Mohit Anchlia mohitanch

Hadoop Map vs Spark stream Map

2015-03-10 Thread Mohit Anchlia
Hi, I am trying to understand Hadoop Map method compared to spark Map and I noticed that spark Map only receives 3 arguments 1) input value 2) output key 3) output value, however in hadoop map it has 4 values 1) input key 2) input value 3) output key 4) output value. Is there any reason it was

Re: Compilation error on JavaPairDStream

2015-03-10 Thread Mohit Anchlia
works now. I should have checked :) On Tue, Mar 10, 2015 at 1:44 PM, Sean Owen so...@cloudera.com wrote: Ah, that's a typo in the example: use words.mapToPair I can make a little PR to fix that. On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am getting

SQL with Spark Streaming

2015-03-10 Thread Mohit Anchlia
Does Spark Streaming also supports SQLs? Something like how Esper does CEP.