Re: RDD Location

2016-12-29 Thread Sun Rui
Maybe you can create your own subclass of RDD and override the 
getPreferredLocations() to implement the logic of dynamic changing of the 
locations.
> On Dec 30, 2016, at 12:06, Fei Hu  wrote:
> 
> Dear all,
> 
> Is there any way to change the host location for a certain partition of RDD?
> 
> "protected def getPreferredLocations(split: Partition)" can be used to 
> initialize the location, but how to change it after the initialization?
> 
> 
> Thanks,
> Fei
> 
> 



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread ayan guha
"If data ingestion speed is faster than data production speed, then
eventually the entire database will be harvested and those workers will
start to "tail" the database for new data streams and the processing
becomes real time."

This part is really database dependent. So it will be hard to generalize
it. For example, say you have a batch interval of 10 secswhat happens
if you get more than one updates on the same row within 10 secs? You will
get a snapshot of every 10 secs. Now, different databases provide different
mechanisms to expose all DML changes, MySQL has binlogs, oracle has log
shipping, cdc,golden gate and so ontypically it requires new product or
new licenses and most likely new component installation on production db :)

So, if we keep real CDC solutions out of scope, a simple snapshot solution
can be achieved fairly easily by

1. Adding INSERTED_ON and UPDATED_ON columns on the source table(s).
2. Keeping a simple table level check pointing (TABLENAME,TS_MAX)
3. Running an extraction/load mechanism which will take data from DB (where
INSERTED_ON > TS_MAX or UPDATED_ON>TS_MAX) and put to HDFS. This can be
sqoop,spark,ETL tool like informatica,ODI,SAP etc. In addition, you can
directly write to Kafka as well. Sqoop, Spark supports Kafka. Most of the
ETL tools would too...
4. Finally, update check point...

You may "determine" checkpoint from the data you already have in HDFS if
you create a Hive structure on it.

Best
AYan



On Fri, Dec 30, 2016 at 4:45 PM, 任弘迪  wrote:

> why not sync binlog of mysql(hopefully the data is immutable and the table
> is append-only), send the log through kafka and then consume it by spark
> streaming?
>
> On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust 
> wrote:
>
>> We don't support this yet, but I've opened this JIRA as it sounds
>> generally useful: https://issues.apache.org/jira/browse/SPARK-19031
>>
>> In the mean time you could try implementing your own Source, but that is
>> pretty low level and is not yet a stable API.
>>
>> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" 
>> wrote:
>>
>>> Hi all,
>>>
>>> Thanks a lot for your contributions to bring us new technologies.
>>>
>>> I don't want to waste your time, so before I write to you, I googled,
>>> checked stackoverflow and mailing list archive with keywords "streaming"
>>> and "jdbc". But I was not able to get any solution to my use case. I hope I
>>> can get some clarification from you.
>>>
>>> The use case is quite straightforward, I need to harvest a relational
>>> database via jdbc, do something with data, and store result into Kafka. I
>>> am stuck at the first step, and the difficulty is as follows:
>>>
>>> 1. The database is too large to ingest with one thread.
>>> 2. The database is dynamic and time series data comes in constantly.
>>>
>>> Then an ideal workflow is that multiple workers process partitions of
>>> data incrementally according to a time window. For example, the processing
>>> starts from the earliest data with each batch containing data for one hour.
>>> If data ingestion speed is faster than data production speed, then
>>> eventually the entire database will be harvested and those workers will
>>> start to "tail" the database for new data streams and the processing
>>> becomes real time.
>>>
>>> With Spark SQL I can ingest data from a JDBC source with partitions
>>> divided by time windows, but how can I dynamically increment the time
>>> windows during execution? Assume that there are two workers ingesting data
>>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task
>>> for 2017-01-03. But I am not able to find out how to increment those values
>>> during execution.
>>>
>>> Then I looked into Structured Streaming. It looks much more promising
>>> because window operations based on event time are considered during
>>> streaming, which could be the solution to my use case. However, from
>>> documentation and code example I did not find anything related to streaming
>>> data from a growing database. Is there anything I can read to achieve my
>>> goal?
>>>
>>> Any suggestion is highly appreciated. Thank you very much and have a
>>> nice day.
>>>
>>> Best regards,
>>> Yang
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha


Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread 任弘迪
why not sync binlog of mysql(hopefully the data is immutable and the table
is append-only), send the log through kafka and then consume it by spark
streaming?

On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust 
wrote:

> We don't support this yet, but I've opened this JIRA as it sounds
> generally useful: https://issues.apache.org/jira/browse/SPARK-19031
>
> In the mean time you could try implementing your own Source, but that is
> pretty low level and is not yet a stable API.
>
> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" 
> wrote:
>
>> Hi all,
>>
>> Thanks a lot for your contributions to bring us new technologies.
>>
>> I don't want to waste your time, so before I write to you, I googled,
>> checked stackoverflow and mailing list archive with keywords "streaming"
>> and "jdbc". But I was not able to get any solution to my use case. I hope I
>> can get some clarification from you.
>>
>> The use case is quite straightforward, I need to harvest a relational
>> database via jdbc, do something with data, and store result into Kafka. I
>> am stuck at the first step, and the difficulty is as follows:
>>
>> 1. The database is too large to ingest with one thread.
>> 2. The database is dynamic and time series data comes in constantly.
>>
>> Then an ideal workflow is that multiple workers process partitions of
>> data incrementally according to a time window. For example, the processing
>> starts from the earliest data with each batch containing data for one hour.
>> If data ingestion speed is faster than data production speed, then
>> eventually the entire database will be harvested and those workers will
>> start to "tail" the database for new data streams and the processing
>> becomes real time.
>>
>> With Spark SQL I can ingest data from a JDBC source with partitions
>> divided by time windows, but how can I dynamically increment the time
>> windows during execution? Assume that there are two workers ingesting data
>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task
>> for 2017-01-03. But I am not able to find out how to increment those values
>> during execution.
>>
>> Then I looked into Structured Streaming. It looks much more promising
>> because window operations based on event time are considered during
>> streaming, which could be the solution to my use case. However, from
>> documentation and code example I did not find anything related to streaming
>> data from a growing database. Is there anything I can read to achieve my
>> goal?
>>
>> Any suggestion is highly appreciated. Thank you very much and have a nice
>> day.
>>
>> Best regards,
>> Yang
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Annabel Melongo
Richard,
In the provided documentation, under the paragraph "Schema Merging", you can 
actually perform what you want this way:
1. Create a schema that read the raw json, line by line
2. Create another schema that read the json file and structure it in ("id", 
"ln", "fn")
3. Merge the two schemas and you'll get what you want.
Thanks 

On Thursday, December 29, 2016 7:18 PM, Richard Xin 
 wrote:
 

 thanks, I have seen this, but this doesn't cover my question.
What I need is read json and include raw json as part of my dataframe. 

On Friday, December 30, 2016 10:23 AM, Annabel Melongo 
 wrote:
 

 Richard,
Below documentation will show you how to create a sparkSession and how to 
programmatically load data:
Spark SQL and DataFrames - Spark 2.1.0 Documentation

  
|  
|   |  
Spark SQL and DataFrames - Spark 2.1.0 Documentation
   |  |

  |

 
 

On Thursday, December 29, 2016 5:16 PM, Richard Xin 
 wrote:
 

 Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25}
{"id":1235,"ln":"Doe","fn":"Jane","age":22}
java code snippet:        final SparkConf sparkConf = new 
SparkConf().setMaster("local[2]").setAppName("json_test");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    HiveContext hc = new HiveContext(ctx.sc());
    DataFrame df = hc.read().json("files/json/example2.json");

what I need is a DataFrame with columns id, ln, fn, age as well as raw_json 
string
any advice on the best practice in java?Thanks,
Richard


   

   

   

Re: Best way to process lookup ETL with Dataframes

2016-12-29 Thread ayan guha
How about this -

select a.*, nvl(b.col,nvl(c.col,'some default'))
from driving_table a
left outer join lookup1 b on a.id=b.id
left outer join lookup2 c on a.id=c,id

?

On Fri, Dec 30, 2016 at 9:55 AM, Sesterhenn, Mike 
wrote:

> Hi all,
>
>
> I'm writing an ETL process with Spark 1.5, and I was wondering the best
> way to do something.
>
>
> A lot of the fields I am processing require an algorithm similar to this:
>
>
> Join input dataframe to a lookup table.
>
> if (that lookup fails (the joined fields are null)) {
>
> Lookup into some other table to join some other fields.
>
> }
>
>
> With Dataframes, it seems the only way to do this is to do something like
> this:
>
>
> Join input dataframe to a lookup table.
>
> if (that lookup fails (the joined fields are null)) {
>
>*SPLIT the dataframe into two DFs via DataFrame.filter(),
>
>   one group with successful lookup, the other failed).*
>
>For failed lookup:  {
>
>Lookup into some other table to grab some other fields.
>
>}
>
>*MERGE the dataframe splits back together via DataFrame.unionAll().*
> }
>
>
> I'm seeing some really large execution plans as you might imagine in the
> Spark Ui, and the processing time seems way out of proportion with the size
> of the dataset.  (~250GB in 9 hours).
>
>
> Is this the best approach to implement an algorithm like this?  Note also
> that some fields I am implementing require multiple staged split/merge
> steps due to cascading lookup joins.
>
>
> Thanks,
>
>
> *Michael Sesterhenn*
>
>
> *msesterh...@cars.com  *
>
>


-- 
Best Regards,
Ayan Guha


[Spark streaming 1.6.0] Spark streaming with Yarn: executors not fully utilized

2016-12-29 Thread Nishant Kumar
I am running spark streaming with Yarn  -

*spark-submit --master yarn --deploy-mode cluster --num-executors 2
> --executor-memory 8g --driver-memory 2g --executor-cores 8 ..*
>

I am consuming Kafka through DireactStream approach (No receiver). I have 2
topics (each with 3 partitions).

I reparation RDD (i have one DStream) into 16 parts (assuming no of
executor * num of cores = 2 * 8 = 16 Is it correct ?) and then i do
foreachPartition and writes each partition to local file and then send it
to other server (not spark) through http (Using apache http sync client
with pooling manager via post with multi-part).

When i checked details of this step in Spark UI, it showed that total 16
task executed on single executor with 8 task at a time.

*This is Spark UI details -*

Details for Stage 717 (Attempt 0)
>
> *Index  ID  Attempt Status  Locality Level  Executor ID / Host  Launch
> Time Duration  GC Time Shuffle Read Size / Records Errors*
> 0  5080  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:46 2 s 11 ms 313.3 KB / 6137
> 1  5081  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:46 2 s 11 ms 328.5 KB / 6452
> 2  5082  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:46 2 s 11 ms 324.3 KB / 6364
> 3  5083  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:46 2 s 11 ms 321.5 KB / 6306
> 4  5084  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:46 2 s 11 ms 324.8 KB / 6364
> 5  5085  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:46 2 s 11 ms 320.8 KB / 6307
> 6  5086  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:46 2 s 11 ms 323.4 KB / 6356
> 7  5087  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:46 3 s 11 ms 316.8 KB / 6207
> 8  5088  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:48 2 s   317.7 KB / 6245
> 9  5089  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name  2016/12/27
> 12:11:48 2 s   320.4 KB / 6280
> 10  5090  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/27 12:11:48 2 s   323.0 KB / 6334
> 11  5091  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/27 12:11:48 2 s   323.7 KB / 6371
> 12  5092  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/27 12:11:48 2 s   316.7 KB / 6218
> 13  5093  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/27 12:11:48 2 s   321.0 KB / 6301
> 14  5094  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/27 12:11:48 2 s   321.4 KB / 6304
> 15  5095  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/27 12:11:49 2 s   319.1 KB / 6267
>

I was expecting it to execute 16 parallel task (2 executor * 8 core) on
either one or more executor. I think i am missing something. Please help.

*More Info:*

   1. Incoming data is not evenly distributed. e.g. 1st topic has 2nd
   partition with 5*5 = 25k messages (5k = maxRatePerPartition, 5s = batch
   interval) and other two partition has almost 0 data few times. The 2nd
   Topic has ~500-4000 message per batch which is evenly distributed across 3
   partition.
   2. when there is no data in topic 1 then i see 16 parallel task
   processing across 2 executors.


*Index ID  Attempt Status  Locality Level  Executor ID / Host  Launch Time
> Duration  GC Time Shuffle Read Size / Records Errors*
> 0 330402  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/28 04:31:41 1 s   19.2 KB / 193
> 1 330403  0 SUCCESS NODE_LOCAL  2 / executor2_machine_host_name
> 2016/12/28 04:31:41 1 s   21.2 KB / 227
> 2 330404  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/28 04:31:41 1 s   20.8 KB / 214
> 3 330405  0 SUCCESS NODE_LOCAL  2 / executor2_machine_host_name
> 2016/12/28 04:31:41 1 s   20.9 KB / 222
> 4 330406  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/28 04:31:41 2 s   21.0 KB / 222
> 5 330407  0 SUCCESS NODE_LOCAL  2 / executor2_machine_host_name
> 2016/12/28 04:31:41 1 s   20.5 KB / 213
> 6 330408  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/28 04:31:41 1 s   20.4 KB / 207
> 7 330409  0 SUCCESS NODE_LOCAL  2 / executor2_machine_host_name
> 2016/12/28 04:31:41 1 s   19.2 KB / 188
> 8 330410  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/28 04:31:41 1 s   20.4 KB / 214
> 9 330411  0 SUCCESS NODE_LOCAL  2 / executor2_machine_host_name
> 2016/12/28 04:31:41 1 s   20.1 KB / 206
> 10  330412  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/28 04:31:41 0.6 s   18.7 KB / 183
> 11  330413  0 SUCCESS NODE_LOCAL  2 / executor2_machine_host_name
> 2016/12/28 04:31:41 1 s   20.6 KB / 217
> 12  330414  0 SUCCESS NODE_LOCAL  1 / executor1_machine_host_name
> 2016/12/28 04:31:41 1 s   20.0 KB / 206
> 13  330415  0 SUCCESS NODE_LOCAL  2 / executor2_machine_host_name
> 2016/12/28 04:31:41 1 s   20.7 KB / 216
> 14  330416  0 SUCCESS 

RDD Location

2016-12-29 Thread Fei Hu
Dear all,

Is there any way to change the host location for a certain partition of RDD?

"protected def getPreferredLocations(split: Partition)" can be used to
initialize the location, but how to change it after the initialization?


Thanks,
Fei


Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Richard Xin
thanks, I have seen this, but this doesn't cover my question.
What I need is read json and include raw json as part of my dataframe. 

On Friday, December 30, 2016 10:23 AM, Annabel Melongo 
 wrote:
 

 Richard,
Below documentation will show you how to create a sparkSession and how to 
programmatically load data:
Spark SQL and DataFrames - Spark 2.1.0 Documentation

  
|  
|   |  
Spark SQL and DataFrames - Spark 2.1.0 Documentation
   |  |

  |

 
 

On Thursday, December 29, 2016 5:16 PM, Richard Xin 
 wrote:
 

 Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25}
{"id":1235,"ln":"Doe","fn":"Jane","age":22}
java code snippet:        final SparkConf sparkConf = new 
SparkConf().setMaster("local[2]").setAppName("json_test");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    HiveContext hc = new HiveContext(ctx.sc());
    DataFrame df = hc.read().json("files/json/example2.json");

what I need is a DataFrame with columns id, ln, fn, age as well as raw_json 
string
any advice on the best practice in java?Thanks,
Richard


   

   

Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Annabel Melongo
Richard,
Below documentation will show you how to create a sparkSession and how to 
programmatically load data:
Spark SQL and DataFrames - Spark 2.1.0 Documentation

  
|  
|   |  
Spark SQL and DataFrames - Spark 2.1.0 Documentation
   |  |

  |

 
 

On Thursday, December 29, 2016 5:16 PM, Richard Xin 
 wrote:
 

 Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25}
{"id":1235,"ln":"Doe","fn":"Jane","age":22}
java code snippet:        final SparkConf sparkConf = new 
SparkConf().setMaster("local[2]").setAppName("json_test");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    HiveContext hc = new HiveContext(ctx.sc());
    DataFrame df = hc.read().json("files/json/example2.json");

what I need is a DataFrame with columns id, ln, fn, age as well as raw_json 
string
any advice on the best practice in java?Thanks,
Richard


   

DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Richard Xin
Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25}
{"id":1235,"ln":"Doe","fn":"Jane","age":22}
java code snippet:        final SparkConf sparkConf = new 
SparkConf().setMaster("local[2]").setAppName("json_test");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    HiveContext hc = new HiveContext(ctx.sc());
    DataFrame df = hc.read().json("files/json/example2.json");

what I need is a DataFrame with columns id, ln, fn, age as well as raw_json 
string
any advice on the best practice in java?Thanks,
Richard


Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Michael Armbrust
We don't support this yet, but I've opened this JIRA as it sounds generally
useful: https://issues.apache.org/jira/browse/SPARK-19031

In the mean time you could try implementing your own Source, but that is
pretty low level and is not yet a stable API.

On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" 
wrote:

> Hi all,
>
> Thanks a lot for your contributions to bring us new technologies.
>
> I don't want to waste your time, so before I write to you, I googled,
> checked stackoverflow and mailing list archive with keywords "streaming"
> and "jdbc". But I was not able to get any solution to my use case. I hope I
> can get some clarification from you.
>
> The use case is quite straightforward, I need to harvest a relational
> database via jdbc, do something with data, and store result into Kafka. I
> am stuck at the first step, and the difficulty is as follows:
>
> 1. The database is too large to ingest with one thread.
> 2. The database is dynamic and time series data comes in constantly.
>
> Then an ideal workflow is that multiple workers process partitions of data
> incrementally according to a time window. For example, the processing
> starts from the earliest data with each batch containing data for one hour.
> If data ingestion speed is faster than data production speed, then
> eventually the entire database will be harvested and those workers will
> start to "tail" the database for new data streams and the processing
> becomes real time.
>
> With Spark SQL I can ingest data from a JDBC source with partitions
> divided by time windows, but how can I dynamically increment the time
> windows during execution? Assume that there are two workers ingesting data
> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task
> for 2017-01-03. But I am not able to find out how to increment those values
> during execution.
>
> Then I looked into Structured Streaming. It looks much more promising
> because window operations based on event time are considered during
> streaming, which could be the solution to my use case. However, from
> documentation and code example I did not find anything related to streaming
> data from a growing database. Is there anything I can read to achieve my
> goal?
>
> Any suggestion is highly appreciated. Thank you very much and have a nice
> day.
>
> Best regards,
> Yang
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Best way to process lookup ETL with Dataframes

2016-12-29 Thread Sesterhenn, Mike
Hi all,


I'm writing an ETL process with Spark 1.5, and I was wondering the best way to 
do something.


A lot of the fields I am processing require an algorithm similar to this:


Join input dataframe to a lookup table.

if (that lookup fails (the joined fields are null)) {

Lookup into some other table to join some other fields.

}


With Dataframes, it seems the only way to do this is to do something like this:


Join input dataframe to a lookup table.

if (that lookup fails (the joined fields are null)) {

   *SPLIT the dataframe into two DFs via DataFrame.filter(),

  one group with successful lookup, the other failed).*

   For failed lookup:  {

   Lookup into some other table to grab some other fields.

   }

   *MERGE the dataframe splits back together via DataFrame.unionAll().*

}


I'm seeing some really large execution plans as you might imagine in the Spark 
Ui, and the processing time seems way out of proportion with the size of the 
dataset.  (~250GB in 9 hours).


Is this the best approach to implement an algorithm like this?  Note also that 
some fields I am implementing require multiple staged split/merge steps due to 
cascading lookup joins.


Thanks,


Michael Sesterhenn

msesterh...@cars.com



Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
Yes you're just dividing by the norm of the vector you passed in. You can
look at the change on that JIRA and probably see how this was added into
the method itself.

On Thu, Dec 29, 2016 at 10:34 PM Manish Tripathi 
wrote:

> ok got that. I understand that the ordering won't change. I just wanted to
> make sure I am getting the right thing or I understand what I am getting
> since it didn't make sense going by the cosine calculation.
>
> One last confirmation and I appreciate all the time you are spending to
> reply:
>
> In the github link issue , it's mentioned fVector is not normalised. By
> fVector it's meant the word vector we want to find synonyms for?. So in my
> example, it would be vector for 'science' word which I passed to the
> method?.
>
> if yes, then I guess, solution should be simple. Just divide the current
> cosine output by the norm of this vector. And this vector we can get by
> doing model.transform('science') if I am right?
>
> Lastly, I would be very happy to update to docs if it is editable for all
> the things I encounter as not mentioned or not very clear.
> ᐧ
>
> On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen  wrote:
>
> Yes, the vectors are not otherwise normalized.
> You are basically getting the cosine similarity, but times the norm of the
> word vector you supplied, because it's not divided through. You could just
> divide the results yourself.
> I don't think it will be back-ported because the the behavior was intended
> in 1.x, just wrongly documented, and we don't want to change the behavior
> in 1.x. The results are still correctly ordered anyway.
>
> On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi 
> wrote:
>
> Sean,
>
> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
> getting is cos(A,B)=dot(A,B)/norm(A) ?
>
> My point with respect to normalization was that if you normalise or don't
> normalize both vectors A,B, the output would be same. Since if I normalize
> A and B, then
>
> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
> we don't normalize it would have a norm in the denominator so output is
> same.
>
> But I understand you are saying in Spark 1.x, one vector was not
> normalized. If that is the case then it makes sense.
>
> Any idea how to fix this (get the right cosine similarity) in Spark 1.x? .
> If the input word in findSynonyms is not normalized while calculating
> cosine, then doing w2vmodel.transform(input_word) to get a vector
> representation and then diving the current result by the norm of this
> vector should be correct?
>
> Also, I am very open to editing the docs on things I find not properly
> documented or wrong, but I need to know if that is allowed (is it like a
> Wiki)?.
> ᐧ
>
> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen  wrote:
>
> It should be the cosine similarity, yes. I think this is what was fixed in
> https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
> The normalization most certainly matters here, and it's the opposite:
> dividing the dot by vec norms gives you the cosine.
>
> Although docs can always be better (and here was a case where it was
> wrong) all of this comes with javadoc and examples. Right now at least,
> .transform() describes the operation as you do, so it is documented. I'd
> propose you invest in improving the docs rather than saying 'this isn't
> what I expected'.
>
> (No, our book isn't a reference for MLlib, more like worked examples)
>
> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi 
> wrote:
>
> I used a word2vec algorithm of spark to compute documents vector of a text.
>
> I then used the findSynonyms function of the model object to get synonyms
> of few words.
>
> I see something like this:
>
>
> ​
>
> I do not understand why the cosine similarity is being calculated as more
> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
> (taking negative angles).
>
> Why it is more than 1 here? What's going wrong here?.
>
> Please note, normalization of the vectors should not be changing the
> cosine similarity values since the formula remains the same. If you
> normalise it's just a dot product then, if you don't it's dot product/
> (normA)*(normB).
>
> I am facing lot of issues with respect to understanding or interpreting
> the output of Spark's ml algos. The documentation is not very clear and
> there is hardly anything mentioned with respect to how and what is being
> returned.
>
> For ex. word2vec algorithm is to convert word to vector form. So I would
> expect .transform method would give me vector of each word in the text.
>
> However .transform basically returns doc2vec (averages all word vectors of
> a text). This is confusing since nothing of this is mentioned in the 

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
ok got that. I understand that the ordering won't change. I just wanted to
make sure I am getting the right thing or I understand what I am getting
since it didn't make sense going by the cosine calculation.

One last confirmation and I appreciate all the time you are spending to
reply:

In the github link issue , it's mentioned fVector is not normalised. By
fVector it's meant the word vector we want to find synonyms for?. So in my
example, it would be vector for 'science' word which I passed to the
method?.

if yes, then I guess, solution should be simple. Just divide the current
cosine output by the norm of this vector. And this vector we can get by
doing model.transform('science') if I am right?

Lastly, I would be very happy to update to docs if it is editable for all
the things I encounter as not mentioned or not very clear.
ᐧ

On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen  wrote:

> Yes, the vectors are not otherwise normalized.
> You are basically getting the cosine similarity, but times the norm of the
> word vector you supplied, because it's not divided through. You could just
> divide the results yourself.
> I don't think it will be back-ported because the the behavior was intended
> in 1.x, just wrongly documented, and we don't want to change the behavior
> in 1.x. The results are still correctly ordered anyway.
>
> On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi 
> wrote:
>
>> Sean,
>>
>> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
>> getting is cos(A,B)=dot(A,B)/norm(A) ?
>>
>> My point with respect to normalization was that if you normalise or don't
>> normalize both vectors A,B, the output would be same. Since if I normalize
>> A and B, then
>>
>> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
>> we don't normalize it would have a norm in the denominator so output is
>> same.
>>
>> But I understand you are saying in Spark 1.x, one vector was not
>> normalized. If that is the case then it makes sense.
>>
>> Any idea how to fix this (get the right cosine similarity) in Spark 1.x?
>> . If the input word in findSynonyms is not normalized while calculating
>> cosine, then doing w2vmodel.transform(input_word) to get a vector
>> representation and then diving the current result by the norm of this
>> vector should be correct?
>>
>> Also, I am very open to editing the docs on things I find not properly
>> documented or wrong, but I need to know if that is allowed (is it like a
>> Wiki)?.
>> ᐧ
>>
>> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen  wrote:
>>
>> It should be the cosine similarity, yes. I think this is what was fixed
>> in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
>> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
>> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
>> The normalization most certainly matters here, and it's the opposite:
>> dividing the dot by vec norms gives you the cosine.
>>
>> Although docs can always be better (and here was a case where it was
>> wrong) all of this comes with javadoc and examples. Right now at least,
>> .transform() describes the operation as you do, so it is documented. I'd
>> propose you invest in improving the docs rather than saying 'this isn't
>> what I expected'.
>>
>> (No, our book isn't a reference for MLlib, more like worked examples)
>>
>> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi 
>> wrote:
>>
>> I used a word2vec algorithm of spark to compute documents vector of a
>> text.
>>
>> I then used the findSynonyms function of the model object to get
>> synonyms of few words.
>>
>> I see something like this:
>>
>>
>> ​
>>
>> I do not understand why the cosine similarity is being calculated as more
>> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
>> (taking negative angles).
>>
>> Why it is more than 1 here? What's going wrong here?.
>>
>> Please note, normalization of the vectors should not be changing the
>> cosine similarity values since the formula remains the same. If you
>> normalise it's just a dot product then, if you don't it's dot product/
>> (normA)*(normB).
>>
>> I am facing lot of issues with respect to understanding or interpreting
>> the output of Spark's ml algos. The documentation is not very clear and
>> there is hardly anything mentioned with respect to how and what is being
>> returned.
>>
>> For ex. word2vec algorithm is to convert word to vector form. So I would
>> expect .transform method would give me vector of each word in the text.
>>
>> However .transform basically returns doc2vec (averages all word vectors
>> of a text). This is confusing since nothing of this is mentioned in the
>> docs and I keep thinking why I have only one word vector instead of word
>> vectors for all words.
>>
>> I do understand by returning doc2vec it is helpful since now one doesn't
>> have to average out 

Re: Invert large matrix

2016-12-29 Thread Yanwei Wayne Zhang
Thanks for the advice. I figured out a way to solve this problem by avoiding 
the matrix representation.


Wayne



From: Sean Owen 
Sent: Thursday, December 29, 2016 1:52 PM
To: Yanwei Wayne Zhang; user
Subject: Re: Invert large matrix

I think the best advice is: don't do that. If you're trying to solve a linear 
system, solve the linear system without explicitly constructing a matrix 
inverse. Is that what you mean?

On Thu, Dec 29, 2016 at 2:22 AM Yanwei Wayne Zhang 
> wrote:

Hi all,


I have a matrix X stored as RDD[SparseVector] that is high dimensional, say 800 
million rows and 2 million columns, and more 95% of the entries are zero.

Is there a way to invert (X'X + eye) efficiently, where X' is the transpose of 
X and eye is the identity matrix? I am thinking of using RowMatrix but not sure 
if it is feasible.

Any suggestion is highly appreciated.


Thanks.


Wayne



Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
Yes, the vectors are not otherwise normalized.
You are basically getting the cosine similarity, but times the norm of the
word vector you supplied, because it's not divided through. You could just
divide the results yourself.
I don't think it will be back-ported because the the behavior was intended
in 1.x, just wrongly documented, and we don't want to change the behavior
in 1.x. The results are still correctly ordered anyway.

On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi 
wrote:

> Sean,
>
> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
> getting is cos(A,B)=dot(A,B)/norm(A) ?
>
> My point with respect to normalization was that if you normalise or don't
> normalize both vectors A,B, the output would be same. Since if I normalize
> A and B, then
>
> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
> we don't normalize it would have a norm in the denominator so output is
> same.
>
> But I understand you are saying in Spark 1.x, one vector was not
> normalized. If that is the case then it makes sense.
>
> Any idea how to fix this (get the right cosine similarity) in Spark 1.x? .
> If the input word in findSynonyms is not normalized while calculating
> cosine, then doing w2vmodel.transform(input_word) to get a vector
> representation and then diving the current result by the norm of this
> vector should be correct?
>
> Also, I am very open to editing the docs on things I find not properly
> documented or wrong, but I need to know if that is allowed (is it like a
> Wiki)?.
> ᐧ
>
> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen  wrote:
>
> It should be the cosine similarity, yes. I think this is what was fixed in
> https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
> The normalization most certainly matters here, and it's the opposite:
> dividing the dot by vec norms gives you the cosine.
>
> Although docs can always be better (and here was a case where it was
> wrong) all of this comes with javadoc and examples. Right now at least,
> .transform() describes the operation as you do, so it is documented. I'd
> propose you invest in improving the docs rather than saying 'this isn't
> what I expected'.
>
> (No, our book isn't a reference for MLlib, more like worked examples)
>
> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi 
> wrote:
>
> I used a word2vec algorithm of spark to compute documents vector of a text.
>
> I then used the findSynonyms function of the model object to get synonyms
> of few words.
>
> I see something like this:
>
>
> ​
>
> I do not understand why the cosine similarity is being calculated as more
> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
> (taking negative angles).
>
> Why it is more than 1 here? What's going wrong here?.
>
> Please note, normalization of the vectors should not be changing the
> cosine similarity values since the formula remains the same. If you
> normalise it's just a dot product then, if you don't it's dot product/
> (normA)*(normB).
>
> I am facing lot of issues with respect to understanding or interpreting
> the output of Spark's ml algos. The documentation is not very clear and
> there is hardly anything mentioned with respect to how and what is being
> returned.
>
> For ex. word2vec algorithm is to convert word to vector form. So I would
> expect .transform method would give me vector of each word in the text.
>
> However .transform basically returns doc2vec (averages all word vectors of
> a text). This is confusing since nothing of this is mentioned in the docs
> and I keep thinking why I have only one word vector instead of word vectors
> for all words.
>
> I do understand by returning doc2vec it is helpful since now one doesn't
> have to average out each word vector for the whole text. But the docs don't
> help or explicitly say that.
>
> This ends up wasting lot of time in just figuring out what is being
> returned from an algorithm from Spark.
>
> Does someone have a better solution for this?
>
> I have read the Spark book. That is not about Mllib. I am not sure if
> Sean's book would cover all the documentation aspect better than what we
> have currently on the docs page.
>
> Thanks
>
>
>
> ᐧ
>
>
>


Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
Sean,

Thanks for answer. I am using Spark 1.6 so are you saying the output I am
getting is cos(A,B)=dot(A,B)/norm(A) ?

My point with respect to normalization was that if you normalise or don't
normalize both vectors A,B, the output would be same. Since if I normalize
A and B, then

Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If we
don't normalize it would have a norm in the denominator so output is same.

But I understand you are saying in Spark 1.x, one vector was not
normalized. If that is the case then it makes sense.

Any idea how to fix this (get the right cosine similarity) in Spark 1.x? .
If the input word in findSynonyms is not normalized while calculating
cosine, then doing w2vmodel.transform(input_word) to get a vector
representation and then diving the current result by the norm of this
vector should be correct?

Also, I am very open to editing the docs on things I find not properly
documented or wrong, but I need to know if that is allowed (is it like a
Wiki)?.
ᐧ

On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen  wrote:

> It should be the cosine similarity, yes. I think this is what was fixed in
> https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
> The normalization most certainly matters here, and it's the opposite:
> dividing the dot by vec norms gives you the cosine.
>
> Although docs can always be better (and here was a case where it was
> wrong) all of this comes with javadoc and examples. Right now at least,
> .transform() describes the operation as you do, so it is documented. I'd
> propose you invest in improving the docs rather than saying 'this isn't
> what I expected'.
>
> (No, our book isn't a reference for MLlib, more like worked examples)
>
> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi 
> wrote:
>
>> I used a word2vec algorithm of spark to compute documents vector of a
>> text.
>>
>> I then used the findSynonyms function of the model object to get
>> synonyms of few words.
>>
>> I see something like this:
>>
>>
>> ​
>>
>> I do not understand why the cosine similarity is being calculated as more
>> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
>> (taking negative angles).
>>
>> Why it is more than 1 here? What's going wrong here?.
>>
>> Please note, normalization of the vectors should not be changing the
>> cosine similarity values since the formula remains the same. If you
>> normalise it's just a dot product then, if you don't it's dot product/
>> (normA)*(normB).
>>
>> I am facing lot of issues with respect to understanding or interpreting
>> the output of Spark's ml algos. The documentation is not very clear and
>> there is hardly anything mentioned with respect to how and what is being
>> returned.
>>
>> For ex. word2vec algorithm is to convert word to vector form. So I would
>> expect .transform method would give me vector of each word in the text.
>>
>> However .transform basically returns doc2vec (averages all word vectors
>> of a text). This is confusing since nothing of this is mentioned in the
>> docs and I keep thinking why I have only one word vector instead of word
>> vectors for all words.
>>
>> I do understand by returning doc2vec it is helpful since now one doesn't
>> have to average out each word vector for the whole text. But the docs don't
>> help or explicitly say that.
>>
>> This ends up wasting lot of time in just figuring out what is being
>> returned from an algorithm from Spark.
>>
>> Does someone have a better solution for this?
>>
>> I have read the Spark book. That is not about Mllib. I am not sure if
>> Sean's book would cover all the documentation aspect better than what we
>> have currently on the docs page.
>>
>> Thanks
>>
>>
>>
>> ᐧ
>>
>


Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
It should be the cosine similarity, yes. I think this is what was fixed in
https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was really
just outputting the 'unnormalized' similarity (dot / norm(a) only) but the
docs said cosine similarity. Now it's cosine similarity in Spark 2. The
normalization most certainly matters here, and it's the opposite: dividing
the dot by vec norms gives you the cosine.

Although docs can always be better (and here was a case where it was wrong)
all of this comes with javadoc and examples. Right now at least,
.transform() describes the operation as you do, so it is documented. I'd
propose you invest in improving the docs rather than saying 'this isn't
what I expected'.

(No, our book isn't a reference for MLlib, more like worked examples)

On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi  wrote:

> I used a word2vec algorithm of spark to compute documents vector of a text.
>
> I then used the findSynonyms function of the model object to get synonyms
> of few words.
>
> I see something like this:
>
>
> ​
>
> I do not understand why the cosine similarity is being calculated as more
> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
> (taking negative angles).
>
> Why it is more than 1 here? What's going wrong here?.
>
> Please note, normalization of the vectors should not be changing the
> cosine similarity values since the formula remains the same. If you
> normalise it's just a dot product then, if you don't it's dot product/
> (normA)*(normB).
>
> I am facing lot of issues with respect to understanding or interpreting
> the output of Spark's ml algos. The documentation is not very clear and
> there is hardly anything mentioned with respect to how and what is being
> returned.
>
> For ex. word2vec algorithm is to convert word to vector form. So I would
> expect .transform method would give me vector of each word in the text.
>
> However .transform basically returns doc2vec (averages all word vectors of
> a text). This is confusing since nothing of this is mentioned in the docs
> and I keep thinking why I have only one word vector instead of word vectors
> for all words.
>
> I do understand by returning doc2vec it is helpful since now one doesn't
> have to average out each word vector for the whole text. But the docs don't
> help or explicitly say that.
>
> This ends up wasting lot of time in just figuring out what is being
> returned from an algorithm from Spark.
>
> Does someone have a better solution for this?
>
> I have read the Spark book. That is not about Mllib. I am not sure if
> Sean's book would cover all the documentation aspect better than what we
> have currently on the docs page.
>
> Thanks
>
>
>
> ᐧ
>


Re: Invert large matrix

2016-12-29 Thread Sean Owen
I think the best advice is: don't do that. If you're trying to solve a
linear system, solve the linear system without explicitly constructing a
matrix inverse. Is that what you mean?

On Thu, Dec 29, 2016 at 2:22 AM Yanwei Wayne Zhang <
actuary_zh...@hotmail.com> wrote:

> Hi all,
>
>
> I have a matrix X stored as RDD[SparseVector] that is high dimensional,
> say 800 million rows and 2 million columns, and more 95% of the entries are
> zero.
>
> Is there a way to invert (X'X + eye) efficiently, where X' is the
> transpose of X and eye is the identity matrix? I am thinking of using
> RowMatrix but not sure if it is feasible.
>
> Any suggestion is highly appreciated.
>
>
> Thanks.
>
>
> Wayne
>
>
>


Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
I used a word2vec algorithm of spark to compute documents vector of a text.

I then used the findSynonyms function of the model object to get synonyms
of few words.

I see something like this:


​

I do not understand why the cosine similarity is being calculated as more
than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
(taking negative angles).

Why it is more than 1 here? What's going wrong here?.

Please note, normalization of the vectors should not be changing the cosine
similarity values since the formula remains the same. If you normalise it's
just a dot product then, if you don't it's dot product/ (normA)*(normB).

I am facing lot of issues with respect to understanding or interpreting the
output of Spark's ml algos. The documentation is not very clear and there
is hardly anything mentioned with respect to how and what is being
returned.

For ex. word2vec algorithm is to convert word to vector form. So I would
expect .transform method would give me vector of each word in the text.

However .transform basically returns doc2vec (averages all word vectors of
a text). This is confusing since nothing of this is mentioned in the docs
and I keep thinking why I have only one word vector instead of word vectors
for all words.

I do understand by returning doc2vec it is helpful since now one doesn't
have to average out each word vector for the whole text. But the docs don't
help or explicitly say that.

This ends up wasting lot of time in just figuring out what is being
returned from an algorithm from Spark.

Does someone have a better solution for this?

I have read the Spark book. That is not about Mllib. I am not sure if
Sean's book would cover all the documentation aspect better than what we
have currently on the docs page.

Thanks



ᐧ


Reading specific column family and columns in Hbase table through spark

2016-12-29 Thread Mich Talebzadeh
Hi,

I have a routine in Spark that iterates  through Hbase rows and tries to
read columns.

My question is how can I read the correct ordering of columns?

example

val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])

val parsed = hBaseRDD.map{ case(b, a) => val iter = a.list().iterator();
( Bytes.toString(a.getRow()).toString,
Bytes.toString( iter.next().getValue()).toString,
Bytes.toString( iter.next().getValue()).toString,
Bytes.toString( iter.next().getValue()).toString,
Bytes.toString(iter.next().getValue())
)}

The above reads the column family columns sequentially. How can I force it
to read specific columns only?


Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Marco Mistroni
Hello
 no sorry i dont have any further insight into that i have seen similar
errors but for completely different issues, and in most of hte cases it had
to do with my data or my processing rather than Spark itself.
What does your program do? you say it runs for 2-3 hours, what is the
logic? just processing a huge amount of data?
doing ML ?
i'd say break down your application..  reduce functionality , run and see
outcome. then add more functionality, run and see again.
I found myself doing htese kinds of things when i got errors in my spark
apps.

To get a concrete help you will have to trim down the code to a few lines
that can reproduces the error  That will be a great start

Sorry for not being of much help

hth
 marco





On Thu, Dec 29, 2016 at 12:00 PM, Palash Gupta 
wrote:

> Hi Marco,
>
> Thanks for your response.
>
> Yes I tested it before & am able to load from linux filesystem and it also
> sometimes have similar issue.
>
> However in both cases (either from hadoop or linux file system), this
> error comes in some specific scenario as per my observations:
>
> 1. When two parallel spark separate application is initiated from one
> driver (not all the time, sometime)
> 2. If one spark jobs are running for more than expected hour let say 2-3
> hours, the second application terminated giving the error.
>
> To debug the problem for me it will be good if you can share some possible
> reasons why failed to broadcast error may come.
>
> Or if you need more logs I can share.
>
> Thanks again Spark User Group.
>
> Best Regards
> Palash Gupta
>
>
>
> Sent from Yahoo Mail on Android
> 
>
> On Thu, 29 Dec, 2016 at 2:57 pm, Marco Mistroni
>  wrote:
> Hi
>  Pls try to read a CSV from filesystem instead of hadoop. If you can read
> it successfully then your hadoop file is the issue and you can start
> debugging from there.
> Hth
>
> On 29 Dec 2016 6:26 am, "Palash Gupta" 
> wrote:
>
>> Hi Apache Spark User team,
>>
>>
>>
>> Greetings!
>>
>> I started developing an application using Apache Hadoop and Spark using
>> python. My pyspark application randomly terminated saying "Failed to get
>> broadcast_1*" and I have been searching for suggestion and support in
>> Stakeoverflow at Failed to get broadcast_1_piece0 of broadcast_1 in
>> pyspark application
>> 
>>
>>
>> Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application
>> I was building an application on Apache Spark 2.00 with Python 3.4 and
>> trying to load some CSV files from HDFS (...
>>
>> 
>>
>>
>> Could you please provide suggestion registering myself in Apache User
>> list or how can I get suggestion or support to debug the problem I am
>> facing?
>>
>> Your response will be highly appreciated.
>>
>>
>>
>> Thanks & Best Regards,
>> Engr. Palash Gupta
>> WhatsApp/Viber: +8801817181502
>> Skype: palash2494
>>
>>
>>


Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hello Jacek,

Actually, Reynold is still the release manager and I am just sending this
message for him :) Sorry. I should have made it clear in my original email.

Thanks,

Yin

On Thu, Dec 29, 2016 at 10:58 AM, Jacek Laskowski  wrote:

> Hi Yan,
>
> I've been surprised the first time when I noticed rxin stepped back and a
> new release manager stepped in. Congrats on your first ANNOUNCE!
>
> I can only expect even more great stuff coming in to Spark from the dev
> team after Reynold spared some time 
>
> Can't wait to read the changes...
>
> Jacek
>
> On 29 Dec 2016 5:03 p.m., "Yin Huai"  wrote:
>
>> Hi all,
>>
>> Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
>> makes significant strides in the production readiness of Structured
>> Streaming, with added support for event time watermarks
>> 
>> and Kafka 0.10 support
>> .
>> In addition, this release focuses more on usability, stability, and polish,
>> resolving over 1200 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 2.1.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes: https://spark.apache.or
>> g/releases/spark-release-2-1-0.html
>>
>> (note: If you see any issues with the release notes, webpage or published
>> artifacts, please contact me directly off-list)
>>
>>
>>


Re: Issue with SparkR setup on RStudio

2016-12-29 Thread Felix Cheung
Any reason you are setting HADOOP_HOME?

>From the error it seems you are running into issue with Hive config likely 
>with trying to load hive-site.xml. Could you try not setting HADOOP_HOME



From: Md. Rezaul Karim 
Sent: Thursday, December 29, 2016 10:24:57 AM
To: spark users
Subject: Issue with SparkR setup on RStudio

Dear Spark users,
I am trying to setup SparkR on RStudio to perform some basic data manipulations 
and ML modeling.  However, I am a strange error while creating SparkR session 
or DataFrame that says: java.lang.IllegalArgumentException Error while 
instantiating 'org.apache.spark.sql.hive.HiveSessionState.
According to Spark documentation at 
http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession, I 
don't need to configure Hive path or related variables.
I have the following source code:

SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]", 
sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/", spark.driver.memory = 
"8g"), enableHiveSupport = TRUE)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
print(df)
head(df)
sparkR.session.stop()
Please note that the HADOOP_HOME  contains the 'winutils.exe' file. The details 
of the eror is as follows:

Error in handleErrors(returnStatus, conn) :  
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':

   at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)

   at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)

   at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)

   at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

   at 
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

   at scala.collection.Traversabl



 Any kind of help would be appreciated.


Regards,
_
Md. Rezaul Karim BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Nicholas Hakobian
If you are using spark 2.0 (as listed in the stackoverflow post) why are
you using the external CSV module from Databricks? Spark 2.0 includes the
functionality from this external module natively, and its possible you are
mixing an older library with a newer spark which could explain a crash.

Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com



On Thu, Dec 29, 2016 at 4:00 AM, Palash Gupta <
spline_pal...@yahoo.com.invalid> wrote:

> Hi Marco,
>
> Thanks for your response.
>
> Yes I tested it before & am able to load from linux filesystem and it also
> sometimes have similar issue.
>
> However in both cases (either from hadoop or linux file system), this
> error comes in some specific scenario as per my observations:
>
> 1. When two parallel spark separate application is initiated from one
> driver (not all the time, sometime)
> 2. If one spark jobs are running for more than expected hour let say 2-3
> hours, the second application terminated giving the error.
>
> To debug the problem for me it will be good if you can share some possible
> reasons why failed to broadcast error may come.
>
> Or if you need more logs I can share.
>
> Thanks again Spark User Group.
>
> Best Regards
> Palash Gupta
>
>
>
> Sent from Yahoo Mail on Android
> 
>
> On Thu, 29 Dec, 2016 at 2:57 pm, Marco Mistroni
>  wrote:
> Hi
>  Pls try to read a CSV from filesystem instead of hadoop. If you can read
> it successfully then your hadoop file is the issue and you can start
> debugging from there.
> Hth
>
> On 29 Dec 2016 6:26 am, "Palash Gupta" 
> wrote:
>
>> Hi Apache Spark User team,
>>
>>
>>
>> Greetings!
>>
>> I started developing an application using Apache Hadoop and Spark using
>> python. My pyspark application randomly terminated saying "Failed to get
>> broadcast_1*" and I have been searching for suggestion and support in
>> Stakeoverflow at Failed to get broadcast_1_piece0 of broadcast_1 in
>> pyspark application
>> 
>>
>>
>> Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application
>> I was building an application on Apache Spark 2.00 with Python 3.4 and
>> trying to load some CSV files from HDFS (...
>>
>> 
>>
>>
>> Could you please provide suggestion registering myself in Apache User
>> list or how can I get suggestion or support to debug the problem I am
>> facing?
>>
>> Your response will be highly appreciated.
>>
>>
>>
>> Thanks & Best Regards,
>> Engr. Palash Gupta
>> WhatsApp/Viber: +8801817181502
>> Skype: palash2494
>>
>>
>>


Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Jacek Laskowski
Hi Yan,

I've been surprised the first time when I noticed rxin stepped back and a
new release manager stepped in. Congrats on your first ANNOUNCE!

I can only expect even more great stuff coming in to Spark from the dev
team after Reynold spared some time 

Can't wait to read the changes...

Jacek

On 29 Dec 2016 5:03 p.m., "Yin Huai"  wrote:

> Hi all,
>
> Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
> makes significant strides in the production readiness of Structured
> Streaming, with added support for event time watermarks
> 
> and Kafka 0.10 support
> .
> In addition, this release focuses more on usability, stability, and polish,
> resolving over 1200 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 2.1.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes: https://spark.apache.
> org/releases/spark-release-2-1-0.html
>
> (note: If you see any issues with the release notes, webpage or published
> artifacts, please contact me directly off-list)
>
>
>


Re: [PySpark - 1.6] - Avoid object serialization

2016-12-29 Thread Holden Karau
Alternatively, using the broadcast functionality can also help with this.

On Thu, Dec 29, 2016 at 3:05 AM Eike von Seggern 
wrote:

> 2016-12-28 20:17 GMT+01:00 Chawla,Sumit :
>
> Would this work for you?
>
> def processRDD(rdd):
> analyzer = ShortTextAnalyzer(root_dir)
> rdd.foreach(lambda record:
> analyzer.analyze_short_text_event(record[1]))
>
> ssc.union(*streams).filter(lambda x: x[1] != None)
> .foreachRDD(lambda rdd: processRDD(rdd))
>
>
> I think, this will still send each analyzer to all executors where rdd
> partitions are stored.
>
> Maybe you can work around this with `RDD.foreachPartition()`:
>
> def processRDD(rdd):
> def partition_func(records):
> analyzer = ShortTextAnalyzer(root_dir)
> for record in records:
> analyzer.analyze_short_text_event(record[1])
> rdd.foreachPartition(partition_func)
>
> This will create one analyzer per partition and RDD.
>
> Best
>
> Eike
>


Issue with SparkR setup on RStudio

2016-12-29 Thread Md. Rezaul Karim
Dear Spark users,

I am trying to setup SparkR on RStudio to perform some basic data
manipulations and ML modeling.  However, I am a strange error while
creating SparkR session or DataFrame that says:
java.lang.IllegalArgumentException
Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState.

According to Spark documentation at
http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession, I
don’t need to configure Hive path or related variables.

I have the following source code:

SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R",
"lib")))
sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]",
sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/", spark.driver.memory =
"8g"), enableHiveSupport = TRUE)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
print(df)
head(df)
sparkR.session.stop()

Please note that the HADOOP_HOME  contains the ‘*winutils.exe’* file. The
details of the eror is as follows:

Error in handleErrors(returnStatus, conn) :
java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveSessionState':

   at
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)

   at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)

   at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)

   at
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)

   at
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)

   at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

   at
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

   at scala.collection.Traversabl


 Any kind of help would be appreciated.




Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-29 Thread Yin Huai
Hello,

We spent sometime preparing artifacts and changes to the website (including
the release notes). I just sent out the the announcement. 2.1.0 is
officially released.

Thanks,

Yin

On Wed, Dec 28, 2016 at 12:42 PM, Justin Miller <
justin.mil...@protectwise.com> wrote:

> Interesting, because a bug that seemed to be fixed in 2.1.0-SNAPSHOT
> doesn't appear to be fixed in 2.1.0 stable (it centered around a
> null-pointer exception during code gen). It seems to be fixed in
> 2.1.1-SNAPSHOT, but I can try stable again.
>
> On Dec 28, 2016, at 1:38 PM, Mark Hamstra  wrote:
>
> A SNAPSHOT build is not a stable artifact, but rather floats to the top of
> commits that are intended for the next release.  So, 2.1.1-SNAPSHOT comes
> after the 2.1.0 release and contains any code at the time that the artifact
> was built that was committed to the branch-2.1 maintenance branch and is,
> therefore, intended for the eventual 2.1.1 maintenance release.  Once a
> release is tagged and stable artifacts for it can be built, there is no
> purpose for s SNAPSHOT of that release -- e.g. there is no longer any
> purpose for a 2.1.0-SNAPSHOT release; if you want 2.1.0, then you should be
> using stable artifacts now, not SNAPSHOTs.
>
> The existence of a SNAPSHOT doesn't imply anything about the release date
> of the associated finished version.  Rather, it only indicates a name that
> is attached to all of the code that is currently intended for the
> associated release number.
>
> On Wed, Dec 28, 2016 at 3:09 PM, Justin Miller <
> justin.mil...@protectwise.com> wrote:
>
>> It looks like the jars for 2.1.0-SNAPSHOT are gone?
>>
>> https://repository.apache.org/content/groups/snapshots/org/a
>> pache/spark/spark-sql_2.11/2.1.0-SNAPSHOT/
>>
>> Also:
>>
>> 2.1.0-SNAPSHOT/
>> 
>>  Fri
>> Dec 23 16:31:42 UTC 2016
>> 2.1.1-SNAPSHOT/
>> 
>>  Wed
>> Dec 28 20:01:10 UTC 2016
>> 2.2.0-SNAPSHOT/
>> 
>>  Wed
>> Dec 28 19:12:38 UTC 2016
>>
>> What's with 2.1.1-SNAPSHOT? Is that version about to be released as well?
>>
>> Thanks!
>> Justin
>>
>> On Dec 28, 2016, at 12:53 PM, Mark Hamstra 
>> wrote:
>>
>> The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0
>>
>> On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers  wrote:
>>
>>> seems like the artifacts are on maven central but the website is not yet
>>> updated.
>>>
>>> strangely the tag v2.1.0 is not yet available on github. i assume its
>>> equal to v2.1.0-rc5
>>>
>>> On Fri, Dec 23, 2016 at 10:52 AM, Justin Miller <
>>> justin.mil...@protectwise.com> wrote:
>>>
 I'm curious about this as well. Seems like the vote passed.

 > On Dec 23, 2016, at 2:00 AM, Aseem Bansal 
 wrote:
 >
 >


 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>>
>
>


[ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hi all,

Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
makes significant strides in the production readiness of Structured
Streaming, with added support for event time watermarks

and Kafka 0.10 support
.
In addition, this release focuses more on usability, stability, and polish,
resolving over 1200 tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 2.1.0, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-1-0.html

(note: If you see any issues with the release notes, webpage or published
artifacts, please contact me directly off-list)


Re: [PySpark - 1.6] - Avoid object serialization

2016-12-29 Thread Eike von Seggern
2016-12-28 20:17 GMT+01:00 Chawla,Sumit :

> Would this work for you?
>
> def processRDD(rdd):
> analyzer = ShortTextAnalyzer(root_dir)
> rdd.foreach(lambda record: analyzer.analyze_short_text_
> event(record[1]))
>
> ssc.union(*streams).filter(lambda x: x[1] != None)
> .foreachRDD(lambda rdd: processRDD(rdd))
>

I think, this will still send each analyzer to all executors where rdd
partitions are stored.

Maybe you can work around this with `RDD.foreachPartition()`:

def processRDD(rdd):
def partition_func(records):
analyzer = ShortTextAnalyzer(root_dir)
for record in records:
analyzer.analyze_short_text_event(record[1])
rdd.foreachPartition(partition_func)

This will create one analyzer per partition and RDD.

Best

Eike


send this email to unsubscribe

2016-12-29 Thread John



[Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Yuanzhe Yang (杨远哲)
Hi all,

Thanks a lot for your contributions to bring us new technologies.

I don't want to waste your time, so before I write to you, I googled, checked 
stackoverflow and mailing list archive with keywords "streaming" and "jdbc". 
But I was not able to get any solution to my use case. I hope I can get some 
clarification from you.

The use case is quite straightforward, I need to harvest a relational database 
via jdbc, do something with data, and store result into Kafka. I am stuck at 
the first step, and the difficulty is as follows:

1. The database is too large to ingest with one thread.
2. The database is dynamic and time series data comes in constantly.

Then an ideal workflow is that multiple workers process partitions of data 
incrementally according to a time window. For example, the processing starts 
from the earliest data with each batch containing data for one hour. If data 
ingestion speed is faster than data production speed, then eventually the 
entire database will be harvested and those workers will start to "tail" the 
database for new data streams and the processing becomes real time.

With Spark SQL I can ingest data from a JDBC source with partitions divided by 
time windows, but how can I dynamically increment the time windows during 
execution? Assume that there are two workers ingesting data of 2017-01-01 and 
2017-01-02, the one which finishes quicker gets next task for 2017-01-03. But I 
am not able to find out how to increment those values during execution.

Then I looked into Structured Streaming. It looks much more promising because 
window operations based on event time are considered during streaming, which 
could be the solution to my use case. However, from documentation and code 
example I did not find anything related to streaming data from a growing 
database. Is there anything I can read to achieve my goal?

Any suggestion is highly appreciated. Thank you very much and have a nice day.

Best regards,
Yang
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Palash Gupta
Hi Marco,
Thanks for your response.
Yes I tested it before & am able to load from linux filesystem and it also 
sometimes have similar issue.
However in both cases (either from hadoop or linux file system), this error 
comes in some specific scenario as per my observations:
1. When two parallel spark separate application is initiated from one driver 
(not all the time, sometime)2. If one spark jobs are running for more than 
expected hour let say 2-3 hours, the second application terminated giving the 
error.
To debug the problem for me it will be good if you can share some possible 
reasons why failed to broadcast error may come.
Or if you need more logs I can share.
Thanks again Spark User Group.
Best RegardsPalash Gupta


Sent from Yahoo Mail on Android 
 
  On Thu, 29 Dec, 2016 at 2:57 pm, Marco Mistroni wrote:   
Hi Pls try to read a CSV from filesystem instead of hadoop. If you can read it 
successfully then your hadoop file is the issue and you can start debugging 
from there.Hth
On 29 Dec 2016 6:26 am, "Palash Gupta"  wrote:

Hi Apache Spark User team,


Greetings!
I started developing an application using Apache Hadoop and Spark using python. 
My pyspark application randomly terminated saying "Failed to get broadcast_1*" 
and I have been searching for suggestion and support in Stakeoverflow at Failed 
to get broadcast_1_piece0 of broadcast_1 in pyspark application

  
|  
|  
|  
|   ||

  |

  |
|  
|   |  
Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application
 I was building an application on Apache Spark 2.00 with Python 3.4 and trying 
to load some CSV files from HDFS (...  |   |

  |

  |

 

Could you please provide suggestion registering myself in Apache User list or 
how can I get suggestion or support to debug the problem I am facing?

Your response will be highly appreciated. 


 Thanks & Best Regards,
Engr. Palash GuptaWhatsApp/Viber: +8801817181502Skype: palash2494


   
  


Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Marco Mistroni
Hi
 Pls try to read a CSV from filesystem instead of hadoop. If you can read
it successfully then your hadoop file is the issue and you can start
debugging from there.
Hth

On 29 Dec 2016 6:26 am, "Palash Gupta" 
wrote:

> Hi Apache Spark User team,
>
>
>
> Greetings!
>
> I started developing an application using Apache Hadoop and Spark using
> python. My pyspark application randomly terminated saying "Failed to get
> broadcast_1*" and I have been searching for suggestion and support in
> Stakeoverflow at Failed to get broadcast_1_piece0 of broadcast_1 in
> pyspark application
> 
>
>
> Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application
> I was building an application on Apache Spark 2.00 with Python 3.4 and
> trying to load some CSV files from HDFS (...
>
> 
>
>
> Could you please provide suggestion registering myself in Apache User list
> or how can I get suggestion or support to debug the problem I am facing?
>
> Your response will be highly appreciated.
>
>
>
> Thanks & Best Regards,
> Engr. Palash Gupta
> WhatsApp/Viber: +8801817181502 <+880%201817-181502>
> Skype: palash2494
>
>
>


spark program for dependent cascading operations

2016-12-29 Thread srimugunthan dhandapani
Hi all,

Can somebody solve the below problem using spark?

There are two dataset of numbers

Set1= {4,6,11,14} and Set2= {5,11,12,3}

I have to subtract each element of Set 1 by elements of Set2. But if the
element in Set2 is bigger, then the residue left after subtraction is used
for the subtraction of next element in Set1

For example:

Set1.[0] - Set2.[0] is (4-5). [with result = 0 and the residue = 1 which is
used for next element subtraction.]

Set1.[1] - Set2.[1] will be (6 - 1(residue from last subtraction) - 11)
[result = 0 and the residue is 6]

Set1.[2] - Set2.[2] will be (11-6-12) [result = 0 and with residue = 7]

Set1.[3] - Set2.[3] will be (14-7-3) [result = 4 and with residue = 0]

giving answer sets {0,0,0,4} and {1, 6, 7, 0}

Considering two more examples:

Set1 = {12, 4, 5} and Set2 = {10, 11, 12, 13}
giving answer sets { 2, 0, 0} and {0, 7, 14, 13}

and

Set1 = {4, 6, 7,2 } and Set2 = {6, 10}
giving answer sets  { 0 ,0, 1, 2 } and {2,6}

Can spark's programming model be used to perform dependent operations on
two Spark RDDs? If so how can i use spark's dataframe or RDD API to perform
these operation?


Re: Jdbc connection from sc.addJar failing

2016-12-29 Thread drtrotter74
Have you found any solution to this?  I am having a similar issue where my
db2jcc.jar license file is not being found.  I was hoping the addjar()
method would work, but it does not seem to help. 

I cannot even get the addjar syntax correct it seems...

Can I just call it inline?: 

sqlContext.addJar("mypath/file.jar"





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Jdbc-connection-from-sc-addJar-failing-tp26797p28259.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org