Re: Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas
I was kind of hoping that I would use Spark in this instance to generate
that intermediate SQL as part of its workflow strategy. Sort of as a
database independent way of doing my preprocessing.
Is there any way that allows me to capture the generated SQL from catalyst?
If so I would just use JDBCRdd with that.

The other option being to generate that SQL in text format which isn't the
nicest thing to do.

On Mar 7, 2017 5:02 PM, "Subhash Sriram" <subhash.sri...@gmail.com> wrote:

> Could you create a view of the table on your JDBC data source and just
> query that from Spark?
>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas <elhassan.wa...@gmail.com>
> wrote:
> >
> > As an example, this is basically what I'm doing:
> >
> >  val myDF = 
> > originalDataFrame.select(col(columnName).when(col(columnName)
> === "foobar", 0).when(col(columnName) === "foobarbaz", 1))
> >
> > Except there's much more columns and much more conditionals. The
> generated Spark workflow starts with an SQL that basically does:
> >
> >SELECT columnName, columnName2, etc. from table;
> >
> > Then the conditionals/transformations are evaluated on the cluster.
> >
> > Is there a way from the DataSet API to force the computation to happen
> on the SQL data source in this case? Or should I work with JDBCRDD and use
> createDataFrame on that?
> >
> >
> >> On 03/07/2017 02:19 PM, Jörn Franke wrote:
> >> Can you provide some source code? I am not sure I understood the
> problem .
> >> If you want to do a preprocessing at the JDBC datasource then you can
> write your own data source. Additionally you may want to modify the sql
> statement to extract the data in the right format and push some
> preprocessing to the database.
> >>
> >>> On 7 Mar 2017, at 12:04, El-Hassan Wanas <elhassan.wa...@gmail.com>
> wrote:
> >>>
> >>> Hello,
> >>>
> >>> There is, as usual, a big table lying on some JDBC data source. I am
> doing some data processing on that data from Spark, however, in order to
> speed up my analysis, I use reduced encodings and minimize the general size
> of the data before processing.
> >>>
> >>> Spark has been doing a great job at generating the proper workflows
> that do that preprocessing for me, but it seems to generate those workflows
> for execution on the Spark Cluster. The issue with that is the large
> transfer cost is still incurred.
> >>>
> >>> Is there any way to force Spark to run the preprocessing on the JDBC
> data source and get the prepared output DataFrame instead?
> >>>
> >>> Thanks,
> >>>
> >>> Wanas
> >>>
> >>>
> >>> -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>


Re: Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas

As an example, this is basically what I'm doing:

  val myDF = 
originalDataFrame.select(col(columnName).when(col(columnName) === 
"foobar", 0).when(col(columnName) === "foobarbaz", 1))


Except there's much more columns and much more conditionals. The 
generated Spark workflow starts with an SQL that basically does:


SELECT columnName, columnName2, etc. from table;

Then the conditionals/transformations are evaluated on the cluster.

Is there a way from the DataSet API to force the computation to happen 
on the SQL data source in this case? Or should I work with JDBCRDD and 
use createDataFrame on that?



On 03/07/2017 02:19 PM, Jörn Franke wrote:

Can you provide some source code? I am not sure I understood the problem .
If you want to do a preprocessing at the JDBC datasource then you can write 
your own data source. Additionally you may want to modify the sql statement to 
extract the data in the right format and push some preprocessing to the 
database.


On 7 Mar 2017, at 12:04, El-Hassan Wanas <elhassan.wa...@gmail.com> wrote:

Hello,

There is, as usual, a big table lying on some JDBC data source. I am doing some 
data processing on that data from Spark, however, in order to speed up my 
analysis, I use reduced encodings and minimize the general size of the data 
before processing.

Spark has been doing a great job at generating the proper workflows that do 
that preprocessing for me, but it seems to generate those workflows for 
execution on the Spark Cluster. The issue with that is the large transfer cost 
is still incurred.

Is there any way to force Spark to run the preprocessing on the JDBC data 
source and get the prepared output DataFrame instead?

Thanks,

Wanas


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas

Hello,

There is, as usual, a big table lying on some JDBC data source. I am 
doing some data processing on that data from Spark, however, in order to 
speed up my analysis, I use reduced encodings and minimize the general 
size of the data before processing.


Spark has been doing a great job at generating the proper workflows that 
do that preprocessing for me, but it seems to generate those workflows 
for execution on the Spark Cluster. The issue with that is the large 
transfer cost is still incurred.


Is there any way to force Spark to run the preprocessing on the JDBC 
data source and get the prepared output DataFrame instead?


Thanks,

Wanas


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



SPARK S3 LZO input; worker stuck

2014-07-13 Thread hassan
Hi

I'm trying to read lzo compressed files from S3 using spark. The lzo files
are not indexed. Spark job starts to read the files just fine but after a
while it just hangs. No network throughput. I have to restart the worker
process to get it back up. Any idea what could be causing this. We were
using uncompressed files before and that worked just fine, went with the
compression to reduce S3 storage. 

Any help would be appreciated. 

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-S3-LZO-input-worker-stuck-tp9584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark S3 LZO input files

2014-07-03 Thread hassan
I'm trying to read input files from S3. The files are compressed using LZO.
i-e from spark-shell 

sc.textFile(s3n://path/xx.lzo).first returns 'String = �LZO?'

Spark does not uncompress the data from the file. I am using cloudera
manager 5, with CDH 5.0.2. I've already installed 'GPLEXTRAS' parcel and
have included 'opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/hadoop-lzo.jar'
and '/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/' in
SPARK_CLASS_PATH. What am I missing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-S3-LZO-input-files-tp8706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Setting executor memory when using spark-shell

2014-06-05 Thread hassan
just use -Dspark.executor.memory=



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-executor-memory-when-using-spark-shell-tp7082p7103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.