Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-19 Thread Gurusamy Thirupathy
Hi guha,

Thanks for your quick response, option a and b are in our table already.
For option b, again the same problem, we don't know which column is date.


Thanks,
-G

On Sun, Mar 18, 2018 at 9:36 PM, Deepak Sharma 
wrote:

> The other approach would to write to temp table and then merge the data.
> But this may be expensive solution.
>
> Thanks
> Deepak
>
> On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy 
> wrote:
>
>> Hi,
>>
>> I am trying to read data from Hive as DataFrame, then trying to write the
>> DF into the Oracle data base. In this case, the date field/column in hive
>> is with Type Varchar(20)
>> but the corresponding column type in Oracle is Date. While reading from
>> hive , the hive table names are dynamically decided(read from another
>> table) based on some job condition(ex. Job1). There are multiple tables
>> like this, so column and the table names are decided only run time. So I
>> can't do type conversion explicitly when read from Hive.
>>
>> So is there any utility/api available in Spark to achieve this conversion
>> issue?
>>
>>
>> Thanks,
>> Guru
>>
>


-- 
Thanks,
Guru


Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread kant kodali
Yes it indeed makes sense! Is there a way to get incremental counts when I
start from 0 and go through 10M records? perhaps count for every micro
batch or something?

On Mon, Mar 19, 2018 at 1:57 PM, Geoff Von Allmen 
wrote:

> Trigger does not mean report the current solution every 'trigger seconds'.
> It means it will attempt to fetch new data and process it no faster than
> trigger seconds intervals.
>
> If you're reading from the beginning and you've got 10M entries in kafka,
> it's likely pulling everything down then processing it completely and
> giving you an initial output. From here on out, it will check kafka every 1
> second for new data and process it, showing you only the updated rows. So
> the initial read will give you the entire output since there is nothing to
> be 'updating' from. If you add data to kafka now that the streaming job has
> completed it's first batch (and leave it running), it will then show you
> the new/updated rows since the last batch every 1 second (assuming it can
> fetch + process in that time span).
>
> If the combined fetch + processing time is > the trigger time, you will
> notice warnings that it is 'falling behind' (I forget the exact verbiage,
> but something to the effect of the calculation took XX time and is falling
> behind). In that case, it will immediately check kafka for new messages and
> begin processing the next batch (if new messages exist).
>
> Hope that makes sense -
>
>
> On Mon, Mar 19, 2018 at 13:36 kant kodali  wrote:
>
>> Hi All,
>>
>> I have 10 million records in my Kafka and I am just trying to
>> spark.sql(select count(*) from kafka_view). I am reading from kafka and
>> writing to kafka.
>>
>> My writeStream is set to "update" mode and trigger interval of one
>> second (Trigger.ProcessingTime(1000)). I expect the counts to be printed
>> every second but looks like it would print after going through all 10M.
>> why?
>>
>> Also, it seems to take forever whereas Linux wc of 10M rows would take 30
>> seconds.
>>
>> Thanks!
>>
>


Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread Geoff Von Allmen
Trigger does not mean report the current solution every 'trigger seconds'.
It means it will attempt to fetch new data and process it no faster than
trigger seconds intervals.

If you're reading from the beginning and you've got 10M entries in kafka,
it's likely pulling everything down then processing it completely and
giving you an initial output. From here on out, it will check kafka every 1
second for new data and process it, showing you only the updated rows. So
the initial read will give you the entire output since there is nothing to
be 'updating' from. If you add data to kafka now that the streaming job has
completed it's first batch (and leave it running), it will then show you
the new/updated rows since the last batch every 1 second (assuming it can
fetch + process in that time span).

If the combined fetch + processing time is > the trigger time, you will
notice warnings that it is 'falling behind' (I forget the exact verbiage,
but something to the effect of the calculation took XX time and is falling
behind). In that case, it will immediately check kafka for new messages and
begin processing the next batch (if new messages exist).

Hope that makes sense -


On Mon, Mar 19, 2018 at 13:36 kant kodali  wrote:

> Hi All,
>
> I have 10 million records in my Kafka and I am just trying to
> spark.sql(select count(*) from kafka_view). I am reading from kafka and
> writing to kafka.
>
> My writeStream is set to "update" mode and trigger interval of one second (
> Trigger.ProcessingTime(1000)). I expect the counts to be printed every
> second but looks like it would print after going through all 10M. why?
>
> Also, it seems to take forever whereas Linux wc of 10M rows would take 30
> seconds.
>
> Thanks!
>


select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread kant kodali
Hi All,

I have 10 million records in my Kafka and I am just trying to
spark.sql(select count(*) from kafka_view). I am reading from kafka and
writing to kafka.

My writeStream is set to "update" mode and trigger interval of one second (
Trigger.ProcessingTime(1000)). I expect the counts to be printed every
second but looks like it would print after going through all 10M. why?

Also, it seems to take forever whereas Linux wc of 10M rows would take 30
seconds.

Thanks!


Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi Keith,

Thank you for the idea!
I have tried it, so now the executor command is looking in the following way :

/bin/bash -c /usr/java/latest//bin/java -server -Xmx51200m
'-Djava.io.tmpdir=my_prefered_path'
-Djava.io.tmpdir=/tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_1521110306769_0041/container_1521110306769_0041_01_04/tmp

JVM is using the second Djava.io.tmpdir parameter and writing
everything to the same directory as before.

Best,
Michael
Sincerely,
Michael Shtelma


On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman  wrote:
> Can you try setting spark.executor.extraJavaOptions to have
> -Djava.io.tmpdir=someValue
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
> On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma 
> wrote:
>>
>> Hi Keith,
>>
>> Thank you for your answer!
>> I have done this, and it is working for spark driver.
>> I would like to make something like this for the executors as well, so
>> that the setting will be used on all the nodes, where I have executors
>> running.
>>
>> Best,
>> Michael
>>
>>
>> On Mon, Mar 19, 2018 at 6:07 PM, Keith Chapman 
>> wrote:
>> > Hi Michael,
>> >
>> > You could either set spark.local.dir through spark conf or
>> > java.io.tmpdir
>> > system property.
>> >
>> > Regards,
>> > Keith.
>> >
>> > http://keith-chapman.com
>> >
>> > On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma 
>> > wrote:
>> >>
>> >> Hi everybody,
>> >>
>> >> I am running spark job on yarn, and my problem is that the blockmgr-*
>> >> folders are being created under
>> >> /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/*
>> >> The size of this folder can grow to a significant size and does not
>> >> really fit into /tmp file system for one job, which makes a real
>> >> problem for my installation.
>> >> I have redefined hadoop.tmp.dir in core-site.xml and
>> >> yarn.nodemanager.local-dirs in yarn-site.xml pointing to other
>> >> location and expected that the block manager will create the files
>> >> there and not under /tmp, but this is not the case. The files are
>> >> created under /tmp.
>> >>
>> >> I am wondering if there is a way to make spark not use /tmp at all and
>> >> configure it to create all the files somewhere else ?
>> >>
>> >> Any assistance would be greatly appreciated!
>> >>
>> >> Best,
>> >> Michael
>> >>
>> >> -
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Structured Streaming: distinct (Spark 2.2)

2018-03-19 Thread Burak Yavuz
I believe the docs are out of date regarding distinct. The behavior should
be as follows:

 - Distinct should be applied across triggers
 - In order to prevent the state from growing indefinitely, you need to add
a watermark
 - If you don't have a watermark, but your key space is small, that's also
fine
 - If a record arrives and is not in the state, it will be outputted
 - If a record arrives and is in the state, it will be ignored
 - Once the watermark passes for a key, it will be dropped from state
 - If a record arrives late, i.e. after the watermark, it will be ignored

HTH!
Burak


On Mon, Mar 19, 2018 at 12:04 PM, Geoff Von Allmen 
wrote:

> I see in the documentation that the distinct operation is not supported
> 
> in Structured Streaming. That being said, I have noticed that you are able
> to successfully call distinct() on a data frame and it seems to perform
> the desired operation and doesn’t fail with the AnalysisException as
> expected. If I call it with a column name specified, then it will fail with
> AnalysisException.
>
> I am using Structured Streaming to read from a Kafka stream and my
> question (and concern) is that:
>
>- The distinct operation is properly applied across the *current*
>batch as read from Kafka, however, the distinct operation would not
>apply across batches.
>
> I have tried the following:
>
>- Started the streaming job to see my baseline data and left the job
>streaming
>- Created events in kafka that would increment my counts if distinct
>was not performing as expected
>- Results:
>   - Distinct still seems to be working over the entire data set even
>   as I add new data.
>   - As I add new data, I see spark process the data (I’m doing output
>   mode = update) but there are no new results indicating the distinct
>   function is in fact still working across batches as spark pulls in the 
> new
>   data from kafka.
>
> Does anyone know more about the intended behavior of distinct in
> Structured Streaming?
>
> If this is working as intended, does this mean I could have a dataset that
> is growing without bound being held in memory/disk or something to that
> effect (so it has some way to make that distinct operation against previous
> data)?
> ​
>


Structured Streaming: distinct (Spark 2.2)

2018-03-19 Thread Geoff Von Allmen
I see in the documentation that the distinct operation is not supported

in Structured Streaming. That being said, I have noticed that you are able
to successfully call distinct() on a data frame and it seems to perform the
desired operation and doesn’t fail with the AnalysisException as expected.
If I call it with a column name specified, then it will fail with
AnalysisException.

I am using Structured Streaming to read from a Kafka stream and my question
(and concern) is that:

   - The distinct operation is properly applied across the *current* batch
   as read from Kafka, however, the distinct operation would not apply
   across batches.

I have tried the following:

   - Started the streaming job to see my baseline data and left the job
   streaming
   - Created events in kafka that would increment my counts if distinct was
   not performing as expected
   - Results:
  - Distinct still seems to be working over the entire data set even as
  I add new data.
  - As I add new data, I see spark process the data (I’m doing output
  mode = update) but there are no new results indicating the distinct
  function is in fact still working across batches as spark pulls
in the new
  data from kafka.

Does anyone know more about the intended behavior of distinct in Structured
Streaming?

If this is working as intended, does this mean I could have a dataset that
is growing without bound being held in memory/disk or something to that
effect (so it has some way to make that distinct operation against previous
data)?
​


Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Can you try setting spark.executor.extraJavaOptions to have -D
java.io.tmpdir=someValue

Regards,
Keith.

http://keith-chapman.com

On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma 
wrote:

> Hi Keith,
>
> Thank you for your answer!
> I have done this, and it is working for spark driver.
> I would like to make something like this for the executors as well, so
> that the setting will be used on all the nodes, where I have executors
> running.
>
> Best,
> Michael
>
>
> On Mon, Mar 19, 2018 at 6:07 PM, Keith Chapman 
> wrote:
> > Hi Michael,
> >
> > You could either set spark.local.dir through spark conf or java.io.tmpdir
> > system property.
> >
> > Regards,
> > Keith.
> >
> > http://keith-chapman.com
> >
> > On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma 
> wrote:
> >>
> >> Hi everybody,
> >>
> >> I am running spark job on yarn, and my problem is that the blockmgr-*
> >> folders are being created under
> >> /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/*
> >> The size of this folder can grow to a significant size and does not
> >> really fit into /tmp file system for one job, which makes a real
> >> problem for my installation.
> >> I have redefined hadoop.tmp.dir in core-site.xml and
> >> yarn.nodemanager.local-dirs in yarn-site.xml pointing to other
> >> location and expected that the block manager will create the files
> >> there and not under /tmp, but this is not the case. The files are
> >> created under /tmp.
> >>
> >> I am wondering if there is a way to make spark not use /tmp at all and
> >> configure it to create all the files somewhere else ?
> >>
> >> Any assistance would be greatly appreciated!
> >>
> >> Best,
> >> Michael
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >
>


Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi Keith,

Thank you for your answer!
I have done this, and it is working for spark driver.
I would like to make something like this for the executors as well, so
that the setting will be used on all the nodes, where I have executors
running.

Best,
Michael


On Mon, Mar 19, 2018 at 6:07 PM, Keith Chapman  wrote:
> Hi Michael,
>
> You could either set spark.local.dir through spark conf or java.io.tmpdir
> system property.
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
> On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma  wrote:
>>
>> Hi everybody,
>>
>> I am running spark job on yarn, and my problem is that the blockmgr-*
>> folders are being created under
>> /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/*
>> The size of this folder can grow to a significant size and does not
>> really fit into /tmp file system for one job, which makes a real
>> problem for my installation.
>> I have redefined hadoop.tmp.dir in core-site.xml and
>> yarn.nodemanager.local-dirs in yarn-site.xml pointing to other
>> location and expected that the block manager will create the files
>> there and not under /tmp, but this is not the case. The files are
>> created under /tmp.
>>
>> I am wondering if there is a way to make spark not use /tmp at all and
>> configure it to create all the files somewhere else ?
>>
>> Any assistance would be greatly appreciated!
>>
>> Best,
>> Michael
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Accessing a file that was passed via --files to spark submit

2018-03-19 Thread Marcelo Vanzin
>From spark-submit -h:

  --files FILES   Comma-separated list of files to be
placed in the working
 directory of each executor. File paths of
these files
 in executors can be accessed via
SparkFiles.get(fileName).

On Sun, Mar 18, 2018 at 1:06 AM, Vitaliy Pisarev
 wrote:
> I am submitting a script to spark-submit and passing it a file using --files
> property. Later on I need to read it in a worker.
>
> I don't understand what API I should use to do that. I figured I'd try just:
>
> with open('myfile'):
>
> but this did not work.
>
> I am able to pass the file using the addFile mechanism but it may not be
> good enough for me.
>
> This may seem like a very simple question but I did not find any
> comprehensive documentation on spark-submit. The docs sure doen't cover it.
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Hi Michael,

You could either set spark.local.dir through spark conf or java.io.tmpdir
system property.

Regards,
Keith.

http://keith-chapman.com

On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma  wrote:

> Hi everybody,
>
> I am running spark job on yarn, and my problem is that the blockmgr-*
> folders are being created under
> /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/*
> The size of this folder can grow to a significant size and does not
> really fit into /tmp file system for one job, which makes a real
> problem for my installation.
> I have redefined hadoop.tmp.dir in core-site.xml and
> yarn.nodemanager.local-dirs in yarn-site.xml pointing to other
> location and expected that the block manager will create the files
> there and not under /tmp, but this is not the case. The files are
> created under /tmp.
>
> I am wondering if there is a way to make spark not use /tmp at all and
> configure it to create all the files somewhere else ?
>
> Any assistance would be greatly appreciated!
>
> Best,
> Michael
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi everybody,

I am running spark job on yarn, and my problem is that the blockmgr-*
folders are being created under
/tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/*
The size of this folder can grow to a significant size and does not
really fit into /tmp file system for one job, which makes a real
problem for my installation.
I have redefined hadoop.tmp.dir in core-site.xml and
yarn.nodemanager.local-dirs in yarn-site.xml pointing to other
location and expected that the block manager will create the files
there and not under /tmp, but this is not the case. The files are
created under /tmp.

I am wondering if there is a way to make spark not use /tmp at all and
configure it to create all the files somewhere else ?

Any assistance would be greatly appreciated!

Best,
Michael

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Calling Pyspark functions in parallel

2018-03-19 Thread Debabrata Ghosh
Thanks Jules ! Appreciate it a lot indeed !




On Mon, Mar 19, 2018 at 7:16 PM, Jules Damji  wrote:

> What’s your PySpark function? Is it a UDF? If so consider using pandas UDF
> introduced in Spark 2.3.
>
> More info here: https://databricks.com/blog/2017/10/30/introducing-
> vectorized-udfs-for-pyspark.html
>
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Mar 18, 2018, at 10:54 PM, Debabrata Ghosh 
> wrote:
>
> Hi,
>  My dataframe is having 2000 rows. For processing each row it
> consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds ,
> which is a very high time.
>
>   Further, I am contemplating to run the function in parallel.
> For example, I would like to divide the total rows in my dataframe by 4 and
> accordingly I will prepare a set of 500 rows and want to call my pyspark
> function in parallel. I wanted to know if there is any library / pyspark
> function which I can leverage to do this execution in parallel.
>
>Will really appreciate for your feedback as per your
> earliest convenience. Thanks,
>
> Debu
>
>


Warnings on data insert into Hive Table using PySpark

2018-03-19 Thread Shahab Yunus
Hi there. When I try to insert data into hive tables using the following
query, I get these warnings below. The data is inserted fine (the query
works without warning directly on hive cli as well.) What is the reason for
these warnings and how can we get rid of them?

I am using pyspark interpreter.

*spark_session.sql("insert into schema_name.table_name
(partition_col='JobA') values ('value1', 'value2', '2018-03-10')")*

*-chgrp: '' does not match expected pattern for group*
*Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...*
*-chgrp: '' does not match expected pattern for group
  *
*Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...*
*-chgrp: '' does not match expected pattern for group*
*Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...*
*-chgrp: '' does not match expected pattern for group*
*Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...*

Software:
Scala version 2.11.8
(OpenJDK 64-Bit Server VM, Java 1.8.0_121)
Spark 2.0.2
Hadoop 2.7.3-amzn-0


Thanks & Regards,
Shahab


[Spark Structured Streaming, Spark 2.3.0] Calling current_timestamp() function within a streaming dataframe results in dataType error

2018-03-19 Thread Artem Moskvin
Hi all,

There's probably a regression in Spark 2.3.0. Running the code below in
2.2.1 succeeds but in 2.3.0 results in error
`org.apache.spark.sql.streaming.StreamingQueryException: Invalid call to
dataType on unresolved object, tree: 'current_timestamp`.

```
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._

val values = spark.
  readStream.
  format("rate").
  load.
  withColumn("current_timestamp", current_timestamp)

values.
  writeStream.
  format("console").
  option("truncate", false).
  trigger(Trigger.ProcessingTime(10.seconds)).
  start().
  awaitTermination()
```

Can anyone confirm the same behavior?


Respectfully,
Artem Moskvin


Re: Calling Pyspark functions in parallel

2018-03-19 Thread Jules Damji
What’s your PySpark function? Is it a UDF? If so consider using pandas UDF 
introduced in Spark 2.3. 

More info here: 
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html


Sent from my iPhone
Pardon the dumb thumb typos :)

> On Mar 18, 2018, at 10:54 PM, Debabrata Ghosh  wrote:
> 
> Hi,
>  My dataframe is having 2000 rows. For processing each row it 
> consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds , 
> which is a very high time. 
> 
>   Further, I am contemplating to run the function in parallel. 
> For example, I would like to divide the total rows in my dataframe by 4 and 
> accordingly I will prepare a set of 500 rows and want to call my pyspark 
> function in parallel. I wanted to know if there is any library / pyspark 
> function which I can leverage to do this execution in parallel.
> 
>Will really appreciate for your feedback as per your earliest 
> convenience. Thanks,
> 
> Debu


Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Serega Sheypak
Hi Jörn, thanks for your reply
Oozie starts ooze java action as single "long running" MapReduce Mapper.
This mapper is responsible for calling main class. Main class belongs to
user and this main class starts spark job.
yarn-cluster is not an option for me. I have to do something special to
maintain "run away" driver. Imagine I want to kill the spark job. I can
just kill oozie workflow, it will kill spawned mapper with main class with
driver inside it.
It won't happen in yarn-cluster mode, since driver is not running in the
process "managed" by oozie.


2018-03-19 13:41 GMT+01:00 Jörn Franke :

> Maybe you should better run it in yarn cluster mode. Yarn client would
> start the driver on the oozie server.
>
> On 19. Mar 2018, at 12:58, Serega Sheypak 
> wrote:
>
> I'm trying to run it as Oozie java action and reduce env dependency. The
> only thing I need is Hadoop Configuration to talk to hdfs and yarn.
> Spark submit is a shell thing. Trying to do all from jvm.
> Oozie java action starts main class which inststiates SparkConf and
> session. It works well in local mode but throws exception when I try to run
> spark as yarn-client
>
> пн, 19 марта 2018 г. в 7:16, Jacek Laskowski :
>
>> Hi,
>>
>> What's the deployment process then (if not using spark-submit)? How is
>> the AM deployed? Why would you want to skip spark-submit?
>>
>> Jacek
>>
>> On 19 Mar 2018 00:20, "Serega Sheypak"  wrote:
>>
>>> Hi, Is it even possible to run spark on yarn as usual java application?
>>> I've built jat using maven with spark-yarn dependency and I manually
>>> populate SparkConf with all hadoop properties.
>>> SparkContext fails to start with exception:
>>>
>>>1. Caused by: java.lang.IllegalStateException: Library directory
>>>'/hadoop/yarn/local/usercache/root/appcache/application_
>>>1521375636129_0022/container_e06_1521375636129_0022_01_
>>>02/assembly/target/scala-2.11/jars' does not exist; make sure
>>>Spark is built.
>>>2. at org.apache.spark.launcher.CommandBuilderUtils.checkState(Com
>>>mandBuilderUtils.java:260)
>>>3. at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(Co
>>>mmandBuilderUtils.java:359)
>>>4. at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(
>>>YarnCommandBuilderUtils.scala:38)
>>>
>>>
>>> I took a look at the code and it has some hardcodes and checks for
>>> specific files layout. I don't follow why :)
>>> Is it possible to bypass such checks?
>>>
>>


Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Jörn Franke
Maybe you should better run it in yarn cluster mode. Yarn client would start 
the driver on the oozie server.

> On 19. Mar 2018, at 12:58, Serega Sheypak  wrote:
> 
> I'm trying to run it as Oozie java action and reduce env dependency. The only 
> thing I need is Hadoop Configuration to talk to hdfs and yarn. 
> Spark submit is a shell thing. Trying to do all from jvm. 
> Oozie java action starts main class which inststiates SparkConf and session. 
> It works well in local mode but throws exception when I try to run spark as 
> yarn-client
> 
> пн, 19 марта 2018 г. в 7:16, Jacek Laskowski :
>> Hi,
>> 
>> What's the deployment process then (if not using spark-submit)? How is the 
>> AM deployed? Why would you want to skip spark-submit?
>> 
>> Jacek
>> 
>>> On 19 Mar 2018 00:20, "Serega Sheypak"  wrote:
>>> Hi, Is it even possible to run spark on yarn as usual java application?
>>> I've built jat using maven with spark-yarn dependency and I manually 
>>> populate SparkConf with all hadoop properties. 
>>> SparkContext fails to start with exception:
>>> Caused by: java.lang.IllegalStateException: Library directory 
>>> '/hadoop/yarn/local/usercache/root/appcache/application_1521375636129_0022/container_e06_1521375636129_0022_01_02/assembly/target/scala-2.11/jars'
>>>  does not exist; make sure Spark is built.
>>> at 
>>> org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:260)
>>> at 
>>> org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(CommandBuilderUtils.java:359)
>>> at 
>>> org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(YarnCommandBuilderUtils.scala:38)
>>> 
>>> I took a look at the code and it has some hardcodes and checks for specific 
>>> files layout. I don't follow why :)
>>> Is it possible to bypass such checks?


Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Serega Sheypak
I'm trying to run it as Oozie java action and reduce env dependency. The
only thing I need is Hadoop Configuration to talk to hdfs and yarn.
Spark submit is a shell thing. Trying to do all from jvm.
Oozie java action starts main class which inststiates SparkConf and
session. It works well in local mode but throws exception when I try to run
spark as yarn-client

пн, 19 марта 2018 г. в 7:16, Jacek Laskowski :

> Hi,
>
> What's the deployment process then (if not using spark-submit)? How is the
> AM deployed? Why would you want to skip spark-submit?
>
> Jacek
>
> On 19 Mar 2018 00:20, "Serega Sheypak"  wrote:
>
>> Hi, Is it even possible to run spark on yarn as usual java application?
>> I've built jat using maven with spark-yarn dependency and I manually
>> populate SparkConf with all hadoop properties.
>> SparkContext fails to start with exception:
>>
>>1. Caused by: java.lang.IllegalStateException: Library directory
>>
>> '/hadoop/yarn/local/usercache/root/appcache/application_1521375636129_0022/container_e06_1521375636129_0022_01_02/assembly/target/scala-2.11/jars'
>>does not exist; make sure Spark is built.
>>2. at org.apache.spark.launcher.CommandBuilderUtils.checkState(
>>CommandBuilderUtils.java:260)
>>3. at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(
>>CommandBuilderUtils.java:359)
>>4. at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(
>>YarnCommandBuilderUtils.scala:38)
>>
>>
>> I took a look at the code and it has some hardcodes and checks for
>> specific files layout. I don't follow why :)
>> Is it possible to bypass such checks?
>>
>


Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Jacek Laskowski
Hi,

What's the deployment process then (if not using spark-submit)? How is the
AM deployed? Why would you want to skip spark-submit?

Jacek

On 19 Mar 2018 00:20, "Serega Sheypak"  wrote:

> Hi, Is it even possible to run spark on yarn as usual java application?
> I've built jat using maven with spark-yarn dependency and I manually
> populate SparkConf with all hadoop properties.
> SparkContext fails to start with exception:
>
>1. Caused by: java.lang.IllegalStateException: Library directory
>'/hadoop/yarn/local/usercache/root/appcache/application_
>1521375636129_0022/container_e06_1521375636129_0022_01_
>02/assembly/target/scala-2.11/jars' does not exist; make sure Spark
>is built.
>2. at org.apache.spark.launcher.CommandBuilderUtils.checkState(Com
>mandBuilderUtils.java:260)
>3. at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(Co
>mmandBuilderUtils.java:359)
>4. at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(
>YarnCommandBuilderUtils.scala:38)
>
>
> I took a look at the code and it has some hardcodes and checks for
> specific files layout. I don't follow why :)
> Is it possible to bypass such checks?
>