Re: PIG to Spark

2018-01-08 Thread Jeff Zhang
Pig support spark engine now, so you can leverage spark execution with pig
script.

I am afraid there's no solution to convert pig script to spark api code





Pralabh Kumar 于2018年1月8日周一 下午11:25写道:

> Hi
>
> Is there a convenient way /open source project to convert PIG scripts to
> Spark.
>
>
> Regards
> Pralabh Kumar
>


Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Felix Cheung
And Hadoop-3.x is not part of the release and sign off for 2.2.1.

Maybe we could update the website to avoid any confusion with "later".


From: Josh Rosen 
Sent: Monday, January 8, 2018 10:17:14 AM
To: akshay naidu
Cc: Saisai Shao; Raj Adyanthaya; spark users
Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

My current best guess is that Spark does not fully support Hadoop 3.x because 
https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for 
Hadoop 3.x) has not been resolved. There are also likely to be transitive 
dependency conflicts which will need to be resolved.

On Mon, Jan 8, 2018 at 8:52 AM akshay naidu 
> wrote:
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and 
later', but my confusion is because spark was released on 1st dec and hadoop-3 
stable version released on 13th Dec. And  to my similar question on 
stackoverflow.com
 , Mr. jacek-laskowski 
replied that spark-2.2.1 doesn't support hadoop-3. so I am just looking for 
more clarity on this doubt before moving on to upgrades.

Thanks all for help.

Akshay.

On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao 
> wrote:
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it is 
not clear whether it is supported or not (or has some issues). I think in the 
download page "Pre-Built for Apache Hadoop 2.7 and later" mostly means that it 
supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).

Thanks
Jerry

2018-01-08 4:50 GMT+08:00 Raj Adyanthaya 
>:
Hi Akshay

On the Spark Download page when you select Spark 2.2.1 it gives you an option 
to select package type. In that, there is an option to select  "Pre-Built for 
Apache Hadoop 2.7 and later". I am assuming it means that it does support 
Hadoop 3.0.

http://spark.apache.org/downloads.html

Thanks,
Raj A.

On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
> wrote:
hello Users,
I need to know whether we can run latest spark on  latest hadoop version i.e., 
spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
thanks.





select with more than 5 typed columns

2018-01-08 Thread Nathan Kronenfeld
Looking in Dataset, there are select functions taking from 1 to 5
TypedColumn arguments.

Is there a built-in way to pull out more than 5 typed columns into a
Dataset (without having to resort to using a DataFrame, or manual
processing of the RDD)?

Thanks,
 - Nathan Kronenfeld
 - Uncharted Software


Re: Spark Monitoring using Jolokia

2018-01-08 Thread Thakrar, Jayesh
And here's some more info on Spark Metrics

https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling


From: Maximiliano Felice 
Date: Monday, January 8, 2018 at 8:14 AM
To: Irtiza Ali 
Cc: 
Subject: Re: Spark Monitoring using Jolokia

Hi!

I don't know very much about them, but I'm currently working in posting custom 
metrics into Graphite. I found useful the internals described in this library: 
https://github.com/groupon/spark-metrics

Hope this at least can give you a hint.

Best of luck!

2018-01-08 10:55 GMT-03:00 Irtiza Ali >:
Hello everyone,

I am building a monitoring tool for the spark, for that I needs sparks metrics. 
I am using jolokia to get the metrics.

I have a question that:

Can I get all the metrics provided by the spark rest api using the Jolokia?

How the spark rest api get the metrics internally?


Thanks



Spark MakeRDD preferred workers

2018-01-08 Thread Christopher Piggott
Hi,

def makeRDD[T](seq: Seq[(T, Seq[String])])(implicit arg0: ClassTag[T]):
RDD[T]
list of tuples of data and location preferences (hostnames of Spark
nodes)


Is that list a list of acceptable choices, and it will choose one of them?
Or is it an ordered list?  I'm trying to ascertain how well it will
distribute if there's a lot of overlap between partitions and nodes.

In my particular case, my RDD is Seq of  (filePath, hosts[])  where hosts
are nodes on which the file's blocks are local.

--C


Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Josh Rosen
My current best guess is that Spark does *not* fully support Hadoop 3.x
because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive
shims for Hadoop 3.x) has not been resolved. There are also likely to be
transitive dependency conflicts which will need to be resolved.

On Mon, Jan 8, 2018 at 8:52 AM akshay naidu  wrote:

> yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and
> later', but my confusion is because spark was released on 1st dec and
> hadoop-3 stable version released on 13th Dec. And  to my similar question
> on stackoverflow.com
> 
> , Mr. jacek-laskowski
>  replied that
> spark-2.2.1 doesn't support hadoop-3. so I am just looking for more clarity
> on this doubt before moving on to upgrades.
>
> Thanks all for help.
>
> Akshay.
>
> On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao 
> wrote:
>
>> AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it
>> is not clear whether it is supported or not (or has some issues). I think
>> in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly
>> means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).
>>
>> Thanks
>> Jerry
>>
>> 2018-01-08 4:50 GMT+08:00 Raj Adyanthaya :
>>
>>> Hi Akshay
>>>
>>> On the Spark Download page when you select Spark 2.2.1 it gives you an
>>> option to select package type. In that, there is an option to select
>>> "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it
>>> does support Hadoop 3.0.
>>>
>>> http://spark.apache.org/downloads.html
>>>
>>> Thanks,
>>> Raj A.
>>>
>>> On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
>>> wrote:
>>>
 hello Users,
 I need to know whether we can run latest spark on  latest hadoop
 version i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on
 13th dec.
 thanks.

>>>
>>>
>>
>


Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread akshay naidu
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and
later', but my confusion is because spark was released on 1st dec and
hadoop-3 stable version released on 13th Dec. And  to my similar question
on stackoverflow.com

, Mr. jacek-laskowski
 replied that
spark-2.2.1 doesn't support hadoop-3. so I am just looking for more clarity
on this doubt before moving on to upgrades.

Thanks all for help.
Akshay.

On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao  wrote:

> AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it
> is not clear whether it is supported or not (or has some issues). I think
> in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly
> means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).
>
> Thanks
> Jerry
>
> 2018-01-08 4:50 GMT+08:00 Raj Adyanthaya :
>
>> Hi Akshay
>>
>> On the Spark Download page when you select Spark 2.2.1 it gives you an
>> option to select package type. In that, there is an option to select
>> "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it
>> does support Hadoop 3.0.
>>
>> http://spark.apache.org/downloads.html
>>
>> Thanks,
>> Raj A.
>>
>> On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
>> wrote:
>>
>>> hello Users,
>>> I need to know whether we can run latest spark on  latest hadoop version
>>> i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
>>> thanks.
>>>
>>
>>
>


PIG to Spark

2018-01-08 Thread Pralabh Kumar
Hi

Is there a convenient way /open source project to convert PIG scripts to
Spark.


Regards
Pralabh Kumar


Spark structured streaming time series forecasting

2018-01-08 Thread Bogdan Cojocar
Hello,

Is there a method to do time series forecasting in spark structured
streaming? Is there any integration going on with spark-ts or a similar
library?

Many thanks,
Bogdan Cojocar


binaryFiles() on directory full of directories

2018-01-08 Thread Christopher Piggott
I have a top level directory in HDFS that contains nothing but
subdirectories (no actual files).  In each one of those subdirs are a
combination of files and other subdirs

/topdir/dir1/(lots of files)
/topdir/dir2/(lots of files)
/topdir/dir2//subdir/(lots of files)

I noticed something strange:

spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*", 32*8)
.filter { case (fileName, contents) => fileName.endsWith(".xyz") }
.map { case (fileName, contents) => 1}
.reduce(_+_)

fails with an ArrayOutOfBoundsException ... but if I specify it as:

  spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*/*", 32*8)

it works.

I played around a little and found I could get the first attempt to work if
I just put one regular file in /topdir

This is with Spark 2.2.1

Is this known behavior?

--C


Re: Spark Monitoring using Jolokia

2018-01-08 Thread Maximiliano Felice
Hi!

I don't know very much about them, but I'm currently working in posting
custom metrics into Graphite. I found useful the internals described in
this library: https://github.com/groupon/spark-metrics


Hope this at least can give you a hint.

Best of luck!

2018-01-08 10:55 GMT-03:00 Irtiza Ali :

> Hello everyone,
>
> I am building a monitoring tool for the spark, for that I needs sparks
> metrics. I am using jolokia to get the metrics.
>
> I have a question that:
>
> Can I get all the metrics provided by the spark rest api using the Jolokia?
>
> How the spark rest api get the metrics internally?
>
>
> Thanks
>


Spark Monitoring using Jolokia

2018-01-08 Thread Irtiza Ali
Hello everyone,

I am building a monitoring tool for the spark, for that I needs sparks
metrics. I am using jolokia to get the metrics.

I have a question that:

Can I get all the metrics provided by the spark rest api using the Jolokia?

How the spark rest api get the metrics internally?


Thanks


Reverse MinMaxScaler in SparkML

2018-01-08 Thread Tomasz Dudek
Hello,

since the similar question on StackOverflow remains unanswered (
https://stackoverflow.com/questions/46092114/is-there-no-inverse-transform-method-for-a-scaler-like-minmaxscaler-in-spark
) and perhaps there is a solution that I am not aware of, I'll ask:

After traning MinMaxScaler(or similar scaler) is there any built-in way to
revert the process? What I mean is to transform the scaled data back to its
original form. SKlearn has a dedicated method inverse_transform that does
exactly that.

I can, of course, get the originalMin/originalMax Vectors from the
MinMaxScalerModel and then map the values myself but it would be nice to
have it built-in.

Yours,
Tomasz