Re: What is the difference between forEachAsync vs forEachPartitionAsync?

2017-04-02 Thread kant kodali
wait rdd operations should infact execute in parallel right? so if I call rdd.forEachAsync that should execute in parallel isn't it? I guess I am a little confused what the difference really is between forEachAsync vs forEachPartitionAsync? besides passing in Tuple vs Iterator of Tuples to the

What is the difference between forEachAsync vs forEachPartitionAsync?

2017-04-02 Thread kant kodali
Hi all, What is the difference between forEachAsync vs forEachPartitionAsync? I couldn't find any comments from the Javadoc. If I were to guess here is what I would say but please correct me if I am wrong. forEachAsync just iterate through values from all partitions one by one in an Async Manner

org.apache.spark.sql.AnalysisException: resolved attribute(s) code#906 missing from code#1992,

2017-04-02 Thread grjohnson35
The exception org.apache.spark.sql.AnalysisException: resolved attribute(s) code#906 missing from code#1992, is being thrown on a dataframe. When I print the schema the dataframe contains the field. Any help is much appreciated. val spark = SparkSession.builder()

Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-02 Thread Irving Duran
Thanks for the share! Thank You, Irving Duran On Sun, Apr 2, 2017 at 7:19 PM, Felix Cheung wrote: > Interesting! > > -- > *From:* Robert Yokota > *Sent:* Sunday, April 2, 2017 9:40:07 AM > *To:* user@spark.apache.org

Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-02 Thread Felix Cheung
Interesting! From: Robert Yokota Sent: Sunday, April 2, 2017 9:40:07 AM To: user@spark.apache.org Subject: Graph Analytics on HBase with HGraphDB and Spark GraphFrames Hi, In case anyone is interested in analyzing graphs in HBase with Apache

Re: Looking at EMR Logs

2017-04-02 Thread Paul Tremblay
Thanks. That seems to work great, except EMR doesn't always copy the logs to S3. The behavior seems inconsistent and I am debugging it now. On Fri, Mar 31, 2017 at 7:46 AM, Vadim Semenov wrote: > You can provide your own log directory, where Spark log will be

Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-02 Thread Robert Yokota
Hi, In case anyone is interested in analyzing graphs in HBase with Apache Spark GraphFrames, this might be helpful: https://yokota.blog/2017/04/02/graph-analytics-on-hbase- with-hgraphdb-and-spark-graphframes/

Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-04-02 Thread Sathish Kumaran Vairavelu
Please let me know if anybody has any thoughts on this issue? On Thu, Mar 30, 2017 at 10:37 PM Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Also, is it possible to cache logical plan and parsed query so that in > subsequent executions it can be reused. It would improve overall

Re: Update DF record with delta data in spark

2017-04-02 Thread Jörn Franke
If you trust that your delta file is correct then this might be the way forward. You just have to keep in mind that sometimes you can have several delta files in parallel and you need to apply then in the correct order or otherwise a deleted row might reappear again. Things get more messy if a

Represent documents as a sequence of wordID & frequency and perform PCA

2017-04-02 Thread Old-School
Imagine that 4 documents exist as shown below: D1: the cat sat on the mat D2: the cat sat on the cat D3: the cat sat D4: the mat sat where each word in the vocabulary can be translated to its wordID: 0 the 1 cat 2 sat 3 on 4 the 5 mat Now every document, can be represented using sparse vectors

Update DF record with delta data in spark

2017-04-02 Thread Selvam Raman
Hi, Table 1:(old File) name number salray Test1 1 1 Test2 2 1 Table 2: (Delta File) namenumber salray Test1 1 4 Test3 3 2 ​i do not have date stamp field in this table. Having composite key of name and number fields. Expected Result name number salray Test1 1 4

Does Apache Spark use any Dependency Injection framework?

2017-04-02 Thread kant kodali
Hi All, I am wondering if can get SparkConf object through Dependency Injection? I currently use HOCON library to store all key/value pairs required to

Re: strange behavior of spark 2.1.0

2017-04-02 Thread Jiang Jacky
Thank you for replying. Actually there is no any message coming during the exception. And there is no OOME in any executor. What I am suspecting it might be caused by AWL. > On Apr 2, 2017, at 5:22 AM, Timur Shenkao wrote: > > Hello, > It's difficult to tell without

Re: Partitioning strategy

2017-04-02 Thread Jörn Franke
You can always repartition, but maybe for your use case different rdds with the same data, but different partition strategies could make sense. It may also make sense to choose an appropriate format on disc (orc, parquet). You have to choose based also on the users' non-functional requirements.

Partitioning strategy

2017-04-02 Thread jasbir.sing
Hi, I have RDD with 4 years’ data with suppose 20 partitions. On runtime, user can decide to select few months or years of RDD. That means, based upon user time selection RDD is being filtered and on filtered RDD further transformations and actions are performed. And, as spark says, child RDD

Re: strange behavior of spark 2.1.0

2017-04-02 Thread Timur Shenkao
Hello, It's difficult to tell without details. I believe one of the executors dies because of OOM or some Runtime Exception (some unforeseen dirty data row). Less probable is GC stop-the-world pause when incoming message rate increases drastically. On Saturday, April 1, 2017, Jiang Jacky

read binary file in PySpark

2017-04-02 Thread Yogesh Vyas
Hi, I am trying to read binary file in PySpark using API binaryRecords(path, recordLength), but it is giving all values as ['\x00', '\x00', '\x00', '\x00',]. But when I am trying to read the same file using binaryFiles(0, it is giving me correct rdd, but in form of key-value pair. The value