Re: Java 9

2017-02-07 Thread kant kodali
Well and the module system!

On Tue, Feb 7, 2017 at 4:03 AM, Timur Shenkao  wrote:

> If I'm not wrong, they got fid of   *sun.misc.Unsafe   *in Java 9.
>
> This class is till used by several libraries & frameworks.
>
> http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
>
> On Tue, Feb 7, 2017 at 12:51 PM, Pete Robbins  wrote:
>
>> Yes, I agree but it may be worthwhile starting to look at this. I was
>> just trying a build and it trips over some of the now defunct/inaccessible
>> sun.misc classes.
>>
>> I was just interested in hearing if anyone has already gone through this
>> to save me duplicating effort.
>>
>> Cheers,
>>
>> On Tue, 7 Feb 2017 at 11:46 Sean Owen  wrote:
>>
>>> I don't think anyone's tried it. I think we'd first have to agree to
>>> drop Java 7 support before that could be seriously considered. The 8-9
>>> difference is a bit more of a breaking change.
>>>
>>> On Tue, Feb 7, 2017 at 11:44 AM Pete Robbins 
>>> wrote:
>>>
>>> Is anyone working on support for running Spark on Java 9? Is this in a
>>> roadmap anywhere?
>>>
>>>
>>> Cheers,
>>>
>>>
>


Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Ignore me, a bit more digging and I was able to find the filesink source


Following that pattern worked a treat!

Thanks again Micheal :)

On Tue, Feb 7, 2017 at 8:44 PM, Sam Elamin  wrote:

> Sorry those are methods I wrote so you can ignore them :)
>
> so just adding a path parameter tells spark thats where the update log is?
>
> Do I check for the unique id there and identify which batch was written
> and which weren't
>
> Are there any examples of this out there? there aren't much connectors in
> the wild which I can reimplement is there
> Should I look at how the file sink is set up and follow that pattern?
>
>
> Regards
> Sam
>
> On Tue, Feb 7, 2017 at 8:40 PM, Michael Armbrust 
> wrote:
>
>> The JSON log is only used by the file sink (which it doesn't seem like
>> you are using).  Though, I'm not sure exactly what is going on inside of
>> setupGoogle or how tableReferenceSource is used.
>>
>> Typically you would run df.writeStream.option("path", "/my/path")... and
>> then the transaction log would go into /my/path/_spark_metadata.
>>
>> There is not requirement that a sink uses this kind of a update log.
>> This is just how we get better transactional semantics than HDFS is
>> providing.  If your sink supports transactions natively you should just use
>> those instead.  We pass a unique ID to the sink method addBatch so that you
>> can make sure you don't commit the same transaction more than once.
>>
>> On Tue, Feb 7, 2017 at 3:29 PM, Sam Elamin 
>> wrote:
>>
>>> Hi Micheal
>>>
>>> If thats the case for the below example, where should i be reading these
>>> json log files first? im assuming sometime between df and query?
>>>
>>>
>>> val df = spark
>>> .readStream
>>> .option("tableReferenceSource",tableName)
>>> .load()
>>> setUpGoogle(spark.sqlContext)
>>>
>>> val query = df
>>>   .writeStream
>>>   .option("tableReferenceSink",tableName2)
>>>   .option("checkpointLocation","checkpoint")
>>>   .start()
>>>
>>>
>>> On Tue, Feb 7, 2017 at 7:24 PM, Michael Armbrust >> > wrote:
>>>
 Read the JSON log of files that is in `/your/path/_spark_metadata` and
 only read files that are present in that log (ignore anything else).

 On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin 
 wrote:

> Ah I see ok so probably it's the retry that's causing it
>
> So when you say I'll have to take this into account, how do I best do
> that? My sink will have to know what was that extra file. And i was under
> the impression spark would automagically know this because of the
> checkpoint directory set when you created the writestream
>
> If that's not the case then how would I go about ensuring no
> duplicates?
>
>
> Thanks again for the awesome support!
>
> Regards
> Sam
> On Tue, 7 Feb 2017 at 18:05, Michael Armbrust 
> wrote:
>
>> Sorry, I think I was a little unclear.  There are two things at play
>> here.
>>
>>  - Exactly-once semantics with file output: spark writes out extra
>> metadata on which files are valid to ensure that failures don't cause us 
>> to
>> "double count" any of the input.  Spark 2.0+ detects this info
>> automatically when you use dataframe reader (spark.read...). There may be
>> extra files, but they will be ignored. If you are consuming the output 
>> with
>> another system you'll have to take this into account.
>>  - Retries: right now we always retry the last batch when
>> restarting.  This is safe/correct because of the above, but we could also
>> optimize this away by tracking more information about batch progress.
>>
>> On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin 
>> wrote:
>>
>> Hmm ok I understand that but the job is running for a good few mins
>> before I kill it so there should not be any jobs left because I can see 
>> in
>> the log that its now polling for new changes, the latest offset is the
>> right one
>>
>> After I kill it and relaunch it picks up that same file?
>>
>>
>> Sorry if I misunderstood you
>>
>> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>> It is always possible that there will be extra jobs from failed
>> batches. However, for the file sink, only one set of files will make it
>> into _spark_metadata directory log.  This is how we get atomic commits 
>> even
>> when there are files in more than one directory.  When reading the files
>> with Spark, we'll detect this directory and use it instead of listStatus 
>> to
>> find the list of 

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Sorry those are methods I wrote so you can ignore them :)

so just adding a path parameter tells spark thats where the update log is?

Do I check for the unique id there and identify which batch was written and
which weren't

Are there any examples of this out there? there aren't much connectors in
the wild which I can reimplement is there
Should I look at how the file sink is set up and follow that pattern?


Regards
Sam

On Tue, Feb 7, 2017 at 8:40 PM, Michael Armbrust 
wrote:

> The JSON log is only used by the file sink (which it doesn't seem like you
> are using).  Though, I'm not sure exactly what is going on inside of
> setupGoogle or how tableReferenceSource is used.
>
> Typically you would run df.writeStream.option("path", "/my/path")... and
> then the transaction log would go into /my/path/_spark_metadata.
>
> There is not requirement that a sink uses this kind of a update log.  This
> is just how we get better transactional semantics than HDFS is providing.
> If your sink supports transactions natively you should just use those
> instead.  We pass a unique ID to the sink method addBatch so that you can
> make sure you don't commit the same transaction more than once.
>
> On Tue, Feb 7, 2017 at 3:29 PM, Sam Elamin 
> wrote:
>
>> Hi Micheal
>>
>> If thats the case for the below example, where should i be reading these
>> json log files first? im assuming sometime between df and query?
>>
>>
>> val df = spark
>> .readStream
>> .option("tableReferenceSource",tableName)
>> .load()
>> setUpGoogle(spark.sqlContext)
>>
>> val query = df
>>   .writeStream
>>   .option("tableReferenceSink",tableName2)
>>   .option("checkpointLocation","checkpoint")
>>   .start()
>>
>>
>> On Tue, Feb 7, 2017 at 7:24 PM, Michael Armbrust 
>> wrote:
>>
>>> Read the JSON log of files that is in `/your/path/_spark_metadata` and
>>> only read files that are present in that log (ignore anything else).
>>>
>>> On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin 
>>> wrote:
>>>
 Ah I see ok so probably it's the retry that's causing it

 So when you say I'll have to take this into account, how do I best do
 that? My sink will have to know what was that extra file. And i was under
 the impression spark would automagically know this because of the
 checkpoint directory set when you created the writestream

 If that's not the case then how would I go about ensuring no duplicates?


 Thanks again for the awesome support!

 Regards
 Sam
 On Tue, 7 Feb 2017 at 18:05, Michael Armbrust 
 wrote:

> Sorry, I think I was a little unclear.  There are two things at play
> here.
>
>  - Exactly-once semantics with file output: spark writes out extra
> metadata on which files are valid to ensure that failures don't cause us 
> to
> "double count" any of the input.  Spark 2.0+ detects this info
> automatically when you use dataframe reader (spark.read...). There may be
> extra files, but they will be ignored. If you are consuming the output 
> with
> another system you'll have to take this into account.
>  - Retries: right now we always retry the last batch when restarting.
> This is safe/correct because of the above, but we could also optimize this
> away by tracking more information about batch progress.
>
> On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin 
> wrote:
>
> Hmm ok I understand that but the job is running for a good few mins
> before I kill it so there should not be any jobs left because I can see in
> the log that its now polling for new changes, the latest offset is the
> right one
>
> After I kill it and relaunch it picks up that same file?
>
>
> Sorry if I misunderstood you
>
> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
> It is always possible that there will be extra jobs from failed
> batches. However, for the file sink, only one set of files will make it
> into _spark_metadata directory log.  This is how we get atomic commits 
> even
> when there are files in more than one directory.  When reading the files
> with Spark, we'll detect this directory and use it instead of listStatus 
> to
> find the list of valid files.
>
> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin 
> wrote:
>
> On another note, when it comes to checkpointing on structured streaming
>
> I noticed if I have  a stream running off s3 and I kill the process.
> The next time the process starts running it dulplicates the last record
> inserted. is that normal?
>
>
>
>
> So say I have streaming enabled on one folder "test" which only has
> two files "update1" and "update 2", then 

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
The JSON log is only used by the file sink (which it doesn't seem like you
are using).  Though, I'm not sure exactly what is going on inside of
setupGoogle or how tableReferenceSource is used.

Typically you would run df.writeStream.option("path", "/my/path")... and
then the transaction log would go into /my/path/_spark_metadata.

There is not requirement that a sink uses this kind of a update log.  This
is just how we get better transactional semantics than HDFS is providing.
If your sink supports transactions natively you should just use those
instead.  We pass a unique ID to the sink method addBatch so that you can
make sure you don't commit the same transaction more than once.

On Tue, Feb 7, 2017 at 3:29 PM, Sam Elamin  wrote:

> Hi Micheal
>
> If thats the case for the below example, where should i be reading these
> json log files first? im assuming sometime between df and query?
>
>
> val df = spark
> .readStream
> .option("tableReferenceSource",tableName)
> .load()
> setUpGoogle(spark.sqlContext)
>
> val query = df
>   .writeStream
>   .option("tableReferenceSink",tableName2)
>   .option("checkpointLocation","checkpoint")
>   .start()
>
>
> On Tue, Feb 7, 2017 at 7:24 PM, Michael Armbrust 
> wrote:
>
>> Read the JSON log of files that is in `/your/path/_spark_metadata` and
>> only read files that are present in that log (ignore anything else).
>>
>> On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin 
>> wrote:
>>
>>> Ah I see ok so probably it's the retry that's causing it
>>>
>>> So when you say I'll have to take this into account, how do I best do
>>> that? My sink will have to know what was that extra file. And i was under
>>> the impression spark would automagically know this because of the
>>> checkpoint directory set when you created the writestream
>>>
>>> If that's not the case then how would I go about ensuring no duplicates?
>>>
>>>
>>> Thanks again for the awesome support!
>>>
>>> Regards
>>> Sam
>>> On Tue, 7 Feb 2017 at 18:05, Michael Armbrust 
>>> wrote:
>>>
 Sorry, I think I was a little unclear.  There are two things at play
 here.

  - Exactly-once semantics with file output: spark writes out extra
 metadata on which files are valid to ensure that failures don't cause us to
 "double count" any of the input.  Spark 2.0+ detects this info
 automatically when you use dataframe reader (spark.read...). There may be
 extra files, but they will be ignored. If you are consuming the output with
 another system you'll have to take this into account.
  - Retries: right now we always retry the last batch when restarting.
 This is safe/correct because of the above, but we could also optimize this
 away by tracking more information about batch progress.

 On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin 
 wrote:

 Hmm ok I understand that but the job is running for a good few mins
 before I kill it so there should not be any jobs left because I can see in
 the log that its now polling for new changes, the latest offset is the
 right one

 After I kill it and relaunch it picks up that same file?


 Sorry if I misunderstood you

 On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

 It is always possible that there will be extra jobs from failed
 batches. However, for the file sink, only one set of files will make it
 into _spark_metadata directory log.  This is how we get atomic commits even
 when there are files in more than one directory.  When reading the files
 with Spark, we'll detect this directory and use it instead of listStatus to
 find the list of valid files.

 On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin 
 wrote:

 On another note, when it comes to checkpointing on structured streaming

 I noticed if I have  a stream running off s3 and I kill the process.
 The next time the process starts running it dulplicates the last record
 inserted. is that normal?




 So say I have streaming enabled on one folder "test" which only has two
 files "update1" and "update 2", then I kill the spark job using Ctrl+C.
 When I rerun the stream it picks up "update 2" again

 Is this normal? isnt ctrl+c a failure?

 I would expect checkpointing to know that update 2 was already processed

 Regards
 Sam

 On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin 
 wrote:

 Thanks Micheal!



 On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

 Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497

 We should add this soon.

 On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Hi Micheal

If thats the case for the below example, where should i be reading these
json log files first? im assuming sometime between df and query?


val df = spark
.readStream
.option("tableReferenceSource",tableName)
.load()
setUpGoogle(spark.sqlContext)

val query = df
  .writeStream
  .option("tableReferenceSink",tableName2)
  .option("checkpointLocation","checkpoint")
  .start()


On Tue, Feb 7, 2017 at 7:24 PM, Michael Armbrust 
wrote:

> Read the JSON log of files that is in `/your/path/_spark_metadata` and
> only read files that are present in that log (ignore anything else).
>
> On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin 
> wrote:
>
>> Ah I see ok so probably it's the retry that's causing it
>>
>> So when you say I'll have to take this into account, how do I best do
>> that? My sink will have to know what was that extra file. And i was under
>> the impression spark would automagically know this because of the
>> checkpoint directory set when you created the writestream
>>
>> If that's not the case then how would I go about ensuring no duplicates?
>>
>>
>> Thanks again for the awesome support!
>>
>> Regards
>> Sam
>> On Tue, 7 Feb 2017 at 18:05, Michael Armbrust 
>> wrote:
>>
>>> Sorry, I think I was a little unclear.  There are two things at play
>>> here.
>>>
>>>  - Exactly-once semantics with file output: spark writes out extra
>>> metadata on which files are valid to ensure that failures don't cause us to
>>> "double count" any of the input.  Spark 2.0+ detects this info
>>> automatically when you use dataframe reader (spark.read...). There may be
>>> extra files, but they will be ignored. If you are consuming the output with
>>> another system you'll have to take this into account.
>>>  - Retries: right now we always retry the last batch when restarting.
>>> This is safe/correct because of the above, but we could also optimize this
>>> away by tracking more information about batch progress.
>>>
>>> On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin 
>>> wrote:
>>>
>>> Hmm ok I understand that but the job is running for a good few mins
>>> before I kill it so there should not be any jobs left because I can see in
>>> the log that its now polling for new changes, the latest offset is the
>>> right one
>>>
>>> After I kill it and relaunch it picks up that same file?
>>>
>>>
>>> Sorry if I misunderstood you
>>>
>>> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust >> > wrote:
>>>
>>> It is always possible that there will be extra jobs from failed batches.
>>> However, for the file sink, only one set of files will make it into
>>> _spark_metadata directory log.  This is how we get atomic commits even when
>>> there are files in more than one directory.  When reading the files with
>>> Spark, we'll detect this directory and use it instead of listStatus to find
>>> the list of valid files.
>>>
>>> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin 
>>> wrote:
>>>
>>> On another note, when it comes to checkpointing on structured streaming
>>>
>>> I noticed if I have  a stream running off s3 and I kill the process. The
>>> next time the process starts running it dulplicates the last record
>>> inserted. is that normal?
>>>
>>>
>>>
>>>
>>> So say I have streaming enabled on one folder "test" which only has two
>>> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
>>> When I rerun the stream it picks up "update 2" again
>>>
>>> Is this normal? isnt ctrl+c a failure?
>>>
>>> I would expect checkpointing to know that update 2 was already processed
>>>
>>> Regards
>>> Sam
>>>
>>> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin 
>>> wrote:
>>>
>>> Thanks Micheal!
>>>
>>>
>>>
>>> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust >> > wrote:
>>>
>>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>>
>>> We should add this soon.
>>>
>>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 
>>> wrote:
>>>
>>> Hi All
>>>
>>> When trying to read a stream off S3 and I try and drop duplicates I get
>>> the following error:
>>>
>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>> Append output mode not supported when there are streaming aggregations on
>>> streaming DataFrames/DataSets;;
>>>
>>>
>>> Whats strange if I use the batch "spark.read.json", it works
>>>
>>> Can I assume you cant drop duplicates in structured streaming
>>>
>>> Regards
>>> Sam
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>


Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Read the JSON log of files that is in `/your/path/_spark_metadata` and only
read files that are present in that log (ignore anything else).

On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin  wrote:

> Ah I see ok so probably it's the retry that's causing it
>
> So when you say I'll have to take this into account, how do I best do
> that? My sink will have to know what was that extra file. And i was under
> the impression spark would automagically know this because of the
> checkpoint directory set when you created the writestream
>
> If that's not the case then how would I go about ensuring no duplicates?
>
>
> Thanks again for the awesome support!
>
> Regards
> Sam
> On Tue, 7 Feb 2017 at 18:05, Michael Armbrust 
> wrote:
>
>> Sorry, I think I was a little unclear.  There are two things at play here.
>>
>>  - Exactly-once semantics with file output: spark writes out extra
>> metadata on which files are valid to ensure that failures don't cause us to
>> "double count" any of the input.  Spark 2.0+ detects this info
>> automatically when you use dataframe reader (spark.read...). There may be
>> extra files, but they will be ignored. If you are consuming the output with
>> another system you'll have to take this into account.
>>  - Retries: right now we always retry the last batch when restarting.
>> This is safe/correct because of the above, but we could also optimize this
>> away by tracking more information about batch progress.
>>
>> On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin 
>> wrote:
>>
>> Hmm ok I understand that but the job is running for a good few mins
>> before I kill it so there should not be any jobs left because I can see in
>> the log that its now polling for new changes, the latest offset is the
>> right one
>>
>> After I kill it and relaunch it picks up that same file?
>>
>>
>> Sorry if I misunderstood you
>>
>> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust 
>> wrote:
>>
>> It is always possible that there will be extra jobs from failed batches.
>> However, for the file sink, only one set of files will make it into
>> _spark_metadata directory log.  This is how we get atomic commits even when
>> there are files in more than one directory.  When reading the files with
>> Spark, we'll detect this directory and use it instead of listStatus to find
>> the list of valid files.
>>
>> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin 
>> wrote:
>>
>> On another note, when it comes to checkpointing on structured streaming
>>
>> I noticed if I have  a stream running off s3 and I kill the process. The
>> next time the process starts running it dulplicates the last record
>> inserted. is that normal?
>>
>>
>>
>>
>> So say I have streaming enabled on one folder "test" which only has two
>> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
>> When I rerun the stream it picks up "update 2" again
>>
>> Is this normal? isnt ctrl+c a failure?
>>
>> I would expect checkpointing to know that update 2 was already processed
>>
>> Regards
>> Sam
>>
>> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin 
>> wrote:
>>
>> Thanks Micheal!
>>
>>
>>
>> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust 
>> wrote:
>>
>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>
>> We should add this soon.
>>
>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 
>> wrote:
>>
>> Hi All
>>
>> When trying to read a stream off S3 and I try and drop duplicates I get
>> the following error:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> Append output mode not supported when there are streaming aggregations on
>> streaming DataFrames/DataSets;;
>>
>>
>> Whats strange if I use the batch "spark.read.json", it works
>>
>> Can I assume you cant drop duplicates in structured streaming
>>
>> Regards
>> Sam
>>
>>
>>
>>
>>
>>
>>
>>


Re: PSA: Java 8 unidoc build

2017-02-07 Thread Shixiong(Ryan) Zhu
@Sean, I'm using Java 8 but don't see these errors until I manually build
the API docs. Hence I think dropping Java 7 support may not help.

Right now we don't build docs in most of builds as building docs takes a
long time (e.g.,
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/2889/ says
it's 1.5 hours).

On Tue, Feb 7, 2017 at 4:06 AM, Sean Owen  wrote:

> I believe that if we ran the Jenkins builds with Java 8 we would catch
> these? this doesn't require dropping Java 7 support or anything.
>
> @joshrosen I know we are just now talking about modifying the Jenkins jobs
> to remove old Hadoop configs. Is it possible to change the master jobs to
> use Java 8? can't hurt really in any event.
>
> Or maybe I'm mistaken and they already run Java 8 and it doesn't catch
> this until Java 8 is the target.
>
> Yeah this is going to keep breaking as javadoc 8 is pretty strict. Thanks
> Hyukjin. It has forced us to clean up a lot of sloppy bits of doc though.
>
>
> On Tue, Feb 7, 2017 at 12:13 AM Joseph Bradley 
> wrote:
>
>> Public service announcement: Our doc build has worked with Java 8 for
>> brief time periods, but new changes keep breaking the Java 8 unidoc build.
>> Please be aware of this, and try to test doc changes with Java 8!  In
>> general, it is stricter than Java 7 for docs.
>>
>> A shout out to @HyukjinKwon and others who have made many fixes for
>> this!  See these sample PRs for some issues causing failures (especially
>> around links):
>> https://github.com/apache/spark/pull/16741
>> https://github.com/apache/spark/pull/16604
>>
>> Thanks,
>> Joseph
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>


Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Ah I see ok so probably it's the retry that's causing it

So when you say I'll have to take this into account, how do I best do that?
My sink will have to know what was that extra file. And i was under the
impression spark would automagically know this because of the checkpoint
directory set when you created the writestream

If that's not the case then how would I go about ensuring no duplicates?


Thanks again for the awesome support!

Regards
Sam
On Tue, 7 Feb 2017 at 18:05, Michael Armbrust 
wrote:

> Sorry, I think I was a little unclear.  There are two things at play here.
>
>  - Exactly-once semantics with file output: spark writes out extra
> metadata on which files are valid to ensure that failures don't cause us to
> "double count" any of the input.  Spark 2.0+ detects this info
> automatically when you use dataframe reader (spark.read...). There may be
> extra files, but they will be ignored. If you are consuming the output with
> another system you'll have to take this into account.
>  - Retries: right now we always retry the last batch when restarting.
> This is safe/correct because of the above, but we could also optimize this
> away by tracking more information about batch progress.
>
> On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin 
> wrote:
>
> Hmm ok I understand that but the job is running for a good few mins before
> I kill it so there should not be any jobs left because I can see in the log
> that its now polling for new changes, the latest offset is the right one
>
> After I kill it and relaunch it picks up that same file?
>
>
> Sorry if I misunderstood you
>
> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust 
> wrote:
>
> It is always possible that there will be extra jobs from failed batches.
> However, for the file sink, only one set of files will make it into
> _spark_metadata directory log.  This is how we get atomic commits even when
> there are files in more than one directory.  When reading the files with
> Spark, we'll detect this directory and use it instead of listStatus to find
> the list of valid files.
>
> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin 
> wrote:
>
> On another note, when it comes to checkpointing on structured streaming
>
> I noticed if I have  a stream running off s3 and I kill the process. The
> next time the process starts running it dulplicates the last record
> inserted. is that normal?
>
>
>
>
> So say I have streaming enabled on one folder "test" which only has two
> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
> When I rerun the stream it picks up "update 2" again
>
> Is this normal? isnt ctrl+c a failure?
>
> I would expect checkpointing to know that update 2 was already processed
>
> Regards
> Sam
>
> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin 
> wrote:
>
> Thanks Micheal!
>
>
>
> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust 
> wrote:
>
> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>
> We should add this soon.
>
> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 
> wrote:
>
> Hi All
>
> When trying to read a stream off S3 and I try and drop duplicates I get
> the following error:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Append
> output mode not supported when there are streaming aggregations on
> streaming DataFrames/DataSets;;
>
>
> Whats strange if I use the batch "spark.read.json", it works
>
> Can I assume you cant drop duplicates in structured streaming
>
> Regards
> Sam
>
>
>
>
>
>
>
>


Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Sorry, I think I was a little unclear.  There are two things at play here.

 - Exactly-once semantics with file output: spark writes out extra metadata
on which files are valid to ensure that failures don't cause us to "double
count" any of the input.  Spark 2.0+ detects this info automatically when
you use dataframe reader (spark.read...). There may be extra files, but
they will be ignored. If you are consuming the output with another system
you'll have to take this into account.
 - Retries: right now we always retry the last batch when restarting.  This
is safe/correct because of the above, but we could also optimize this away
by tracking more information about batch progress.

On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin  wrote:

> Hmm ok I understand that but the job is running for a good few mins before
> I kill it so there should not be any jobs left because I can see in the log
> that its now polling for new changes, the latest offset is the right one
>
> After I kill it and relaunch it picks up that same file?
>
>
> Sorry if I misunderstood you
>
> On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust 
> wrote:
>
>> It is always possible that there will be extra jobs from failed batches.
>> However, for the file sink, only one set of files will make it into
>> _spark_metadata directory log.  This is how we get atomic commits even when
>> there are files in more than one directory.  When reading the files with
>> Spark, we'll detect this directory and use it instead of listStatus to find
>> the list of valid files.
>>
>> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin 
>> wrote:
>>
>>> On another note, when it comes to checkpointing on structured streaming
>>>
>>> I noticed if I have  a stream running off s3 and I kill the process. The
>>> next time the process starts running it dulplicates the last record
>>> inserted. is that normal?
>>>
>>>
>>>
>>>
>>> So say I have streaming enabled on one folder "test" which only has two
>>> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
>>> When I rerun the stream it picks up "update 2" again
>>>
>>> Is this normal? isnt ctrl+c a failure?
>>>
>>> I would expect checkpointing to know that update 2 was already processed
>>>
>>> Regards
>>> Sam
>>>
>>> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin 
>>> wrote:
>>>
 Thanks Micheal!



 On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>
> We should add this soon.
>
> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 
> wrote:
>
>> Hi All
>>
>> When trying to read a stream off S3 and I try and drop duplicates I
>> get the following error:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> Append output mode not supported when there are streaming aggregations on
>> streaming DataFrames/DataSets;;
>>
>>
>> Whats strange if I use the batch "spark.read.json", it works
>>
>> Can I assume you cant drop duplicates in structured streaming
>>
>> Regards
>> Sam
>>
>
>

>>>
>>
>


Re: PSA: Java 8 unidoc build

2017-02-07 Thread Felix Cheung
+1 for all the great work going in for this, HyukjinKwon, and +1 on what Sean 
says about "Jenkins builds with Java 8" and we should catch these nasty 
javadoc8 issue quickly.

I think that would be the great first step to move away from java 7


_
From: Reynold Xin >
Sent: Tuesday, February 7, 2017 4:48 AM
Subject: Re: PSA: Java 8 unidoc build
To: Sean Owen >
Cc: Josh Rosen >, 
Joseph Bradley >, 
>


I don't know if this would help but I think we can also officially stop 
supporting Java 7 ...


On Tue, Feb 7, 2017 at 1:06 PM, Sean Owen 
> wrote:
I believe that if we ran the Jenkins builds with Java 8 we would catch these? 
this doesn't require dropping Java 7 support or anything.

@joshrosen I know we are just now talking about modifying the Jenkins jobs to 
remove old Hadoop configs. Is it possible to change the master jobs to use Java 
8? can't hurt really in any event.

Or maybe I'm mistaken and they already run Java 8 and it doesn't catch this 
until Java 8 is the target.

Yeah this is going to keep breaking as javadoc 8 is pretty strict. Thanks 
Hyukjin. It has forced us to clean up a lot of sloppy bits of doc though.


On Tue, Feb 7, 2017 at 12:13 AM Joseph Bradley 
> wrote:
Public service announcement: Our doc build has worked with Java 8 for brief 
time periods, but new changes keep breaking the Java 8 unidoc build.  Please be 
aware of this, and try to test doc changes with Java 8!  In general, it is 
stricter than Java 7 for docs.

A shout out to @HyukjinKwon and others who have made many fixes for this!  See 
these sample PRs for some issues causing failures (especially around links):
https://github.com/apache/spark/pull/16741
https://github.com/apache/spark/pull/16604

Thanks,
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]





Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Hmm ok I understand that but the job is running for a good few mins before
I kill it so there should not be any jobs left because I can see in the log
that its now polling for new changes, the latest offset is the right one

After I kill it and relaunch it picks up that same file?


Sorry if I misunderstood you

On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust 
wrote:

> It is always possible that there will be extra jobs from failed batches.
> However, for the file sink, only one set of files will make it into
> _spark_metadata directory log.  This is how we get atomic commits even when
> there are files in more than one directory.  When reading the files with
> Spark, we'll detect this directory and use it instead of listStatus to find
> the list of valid files.
>
> On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin 
> wrote:
>
>> On another note, when it comes to checkpointing on structured streaming
>>
>> I noticed if I have  a stream running off s3 and I kill the process. The
>> next time the process starts running it dulplicates the last record
>> inserted. is that normal?
>>
>>
>>
>>
>> So say I have streaming enabled on one folder "test" which only has two
>> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
>> When I rerun the stream it picks up "update 2" again
>>
>> Is this normal? isnt ctrl+c a failure?
>>
>> I would expect checkpointing to know that update 2 was already processed
>>
>> Regards
>> Sam
>>
>> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin 
>> wrote:
>>
>>> Thanks Micheal!
>>>
>>>
>>>
>>> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust >> > wrote:
>>>
 Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497

 We should add this soon.

 On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 
 wrote:

> Hi All
>
> When trying to read a stream off S3 and I try and drop duplicates I
> get the following error:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> Append output mode not supported when there are streaming aggregations on
> streaming DataFrames/DataSets;;
>
>
> Whats strange if I use the batch "spark.read.json", it works
>
> Can I assume you cant drop duplicates in structured streaming
>
> Regards
> Sam
>


>>>
>>
>


Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
It is always possible that there will be extra jobs from failed batches.
However, for the file sink, only one set of files will make it into
_spark_metadata directory log.  This is how we get atomic commits even when
there are files in more than one directory.  When reading the files with
Spark, we'll detect this directory and use it instead of listStatus to find
the list of valid files.

On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin  wrote:

> On another note, when it comes to checkpointing on structured streaming
>
> I noticed if I have  a stream running off s3 and I kill the process. The
> next time the process starts running it dulplicates the last record
> inserted. is that normal?
>
>
>
>
> So say I have streaming enabled on one folder "test" which only has two
> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
> When I rerun the stream it picks up "update 2" again
>
> Is this normal? isnt ctrl+c a failure?
>
> I would expect checkpointing to know that update 2 was already processed
>
> Regards
> Sam
>
> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin 
> wrote:
>
>> Thanks Micheal!
>>
>>
>>
>> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust 
>> wrote:
>>
>>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>>
>>> We should add this soon.
>>>
>>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 
>>> wrote:
>>>
 Hi All

 When trying to read a stream off S3 and I try and drop duplicates I get
 the following error:

 Exception in thread "main" org.apache.spark.sql.AnalysisException:
 Append output mode not supported when there are streaming aggregations on
 streaming DataFrames/DataSets;;


 Whats strange if I use the batch "spark.read.json", it works

 Can I assume you cant drop duplicates in structured streaming

 Regards
 Sam

>>>
>>>
>>
>


Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
On another note, when it comes to checkpointing on structured streaming

I noticed if I have  a stream running off s3 and I kill the process. The
next time the process starts running it dulplicates the last record
inserted. is that normal?




So say I have streaming enabled on one folder "test" which only has two
files "update1" and "update 2", then I kill the spark job using Ctrl+C.
When I rerun the stream it picks up "update 2" again

Is this normal? isnt ctrl+c a failure?

I would expect checkpointing to know that update 2 was already processed

Regards
Sam

On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin  wrote:

> Thanks Micheal!
>
>
>
> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust 
> wrote:
>
>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>
>> We should add this soon.
>>
>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 
>> wrote:
>>
>>> Hi All
>>>
>>> When trying to read a stream off S3 and I try and drop duplicates I get
>>> the following error:
>>>
>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>> Append output mode not supported when there are streaming aggregations on
>>> streaming DataFrames/DataSets;;
>>>
>>>
>>> Whats strange if I use the batch "spark.read.json", it works
>>>
>>> Can I assume you cant drop duplicates in structured streaming
>>>
>>> Regards
>>> Sam
>>>
>>
>>
>


Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Thanks Micheal!



On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust 
wrote:

> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>
> We should add this soon.
>
> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin 
> wrote:
>
>> Hi All
>>
>> When trying to read a stream off S3 and I try and drop duplicates I get
>> the following error:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> Append output mode not supported when there are streaming aggregations on
>> streaming DataFrames/DataSets;;
>>
>>
>> Whats strange if I use the batch "spark.read.json", it works
>>
>> Can I assume you cant drop duplicates in structured streaming
>>
>> Regards
>> Sam
>>
>
>


Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497

We should add this soon.

On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin  wrote:

> Hi All
>
> When trying to read a stream off S3 and I try and drop duplicates I get
> the following error:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Append
> output mode not supported when there are streaming aggregations on
> streaming DataFrames/DataSets;;
>
>
> Whats strange if I use the batch "spark.read.json", it works
>
> Can I assume you cant drop duplicates in structured streaming
>
> Regards
> Sam
>


Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Hi All

When trying to read a stream off S3 and I try and drop duplicates I get the
following error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Append
output mode not supported when there are streaming aggregations on
streaming DataFrames/DataSets;;


Whats strange if I use the batch "spark.read.json", it works

Can I assume you cant drop duplicates in structured streaming

Regards
Sam


Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2017-02-07 Thread Reynold Xin
Bumping this.

Given we see the occassional build breaks with Java 8, we should reconsider
this and do it for 2.2 or 2.3. By the time 2.2 is released, it will almost
be an year since this thread started.


On Sun, Jul 24, 2016 at 12:59 AM, Mark Hamstra 
wrote:

> Sure, signalling well ahead of time is good, as is getting better
> performance from Java 8; but do either of those interests really require
> dropping Java 7 support sooner rather than later?
>
> Now, to retroactively copy edit myself, when I previously wrote "after
> all or nearly all relevant clusters are actually no longer running on Java
> 6", I meant "...no longer running on Java 7".  We should be at a point now
> where there aren't many Java 6 clusters left, but my sense is that there
> are still quite a number of Java 7 clusters around, and that there will be
> for a good while still.
>
> On Sat, Jul 23, 2016 at 3:50 PM, Koert Kuipers  wrote:
>
>> i care about signalling it in advance mostly. and given the performance
>> differences we do have some interest in pushing towards java 8
>>
>> On Jul 23, 2016 6:10 PM, "Mark Hamstra"  wrote:
>>
>> Why the push to remove Java 7 support as soon as possible (which is how I
>> read your "cluster admins plan to migrate by date X, so Spark should end
>> Java 7 support then, too")?  First, I don't think we should be removing
>> Java 7 support until some time after all or nearly all relevant clusters
>> are actually no longer running on Java 6, and that targeting removal of
>> support at our best guess about when admins are just *planning* to migrate
>> isn't a very good idea.  Second, I don't see the significant difficulty or
>> harm in continuing to support Java 7 for a while longer.
>>
>> On Sat, Jul 23, 2016 at 2:54 PM, Koert Kuipers  wrote:
>>
>>> dropping java 7 support was considered for spark 2.0.x but we decided
>>> against it.
>>>
>>> ideally dropping support for a java version should be communicated far
>>> in advance to facilitate the transition.
>>>
>>> is this the right time to make that decision and start communicating it
>>> (mailing list, jira, etc.)? perhaps for spark 2.1.x or spark 2.2.x?
>>>
>>> my general sense is that most cluster admins have plans to migrate to
>>> java 8 before end of year. so that could line up nicely with spark 2.2
>>>
>>>
>>
>>
>


Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2017-02-07 Thread Reynold Xin
BTW I created a JIRA ticket for tracking:
https://issues.apache.org/jira/browse/SPARK-19493

We of course shouldn't do anything until we achieve consensus.


On Tue, Feb 7, 2017 at 3:47 PM, Reynold Xin  wrote:

> Bumping this.
>
> Given we see the occassional build breaks with Java 8, we should
> reconsider this and do it for 2.2 or 2.3. By the time 2.2 is released, it
> will almost be an year since this thread started.
>
>
> On Sun, Jul 24, 2016 at 12:59 AM, Mark Hamstra 
> wrote:
>
>> Sure, signalling well ahead of time is good, as is getting better
>> performance from Java 8; but do either of those interests really require
>> dropping Java 7 support sooner rather than later?
>>
>> Now, to retroactively copy edit myself, when I previously wrote "after
>> all or nearly all relevant clusters are actually no longer running on Java
>> 6", I meant "...no longer running on Java 7".  We should be at a point now
>> where there aren't many Java 6 clusters left, but my sense is that there
>> are still quite a number of Java 7 clusters around, and that there will be
>> for a good while still.
>>
>> On Sat, Jul 23, 2016 at 3:50 PM, Koert Kuipers  wrote:
>>
>>> i care about signalling it in advance mostly. and given the performance
>>> differences we do have some interest in pushing towards java 8
>>>
>>> On Jul 23, 2016 6:10 PM, "Mark Hamstra"  wrote:
>>>
>>> Why the push to remove Java 7 support as soon as possible (which is how
>>> I read your "cluster admins plan to migrate by date X, so Spark should end
>>> Java 7 support then, too")?  First, I don't think we should be removing
>>> Java 7 support until some time after all or nearly all relevant clusters
>>> are actually no longer running on Java 6, and that targeting removal of
>>> support at our best guess about when admins are just *planning* to migrate
>>> isn't a very good idea.  Second, I don't see the significant difficulty or
>>> harm in continuing to support Java 7 for a while longer.
>>>
>>> On Sat, Jul 23, 2016 at 2:54 PM, Koert Kuipers 
>>> wrote:
>>>
 dropping java 7 support was considered for spark 2.0.x but we decided
 against it.

 ideally dropping support for a java version should be communicated far
 in advance to facilitate the transition.

 is this the right time to make that decision and start communicating it
 (mailing list, jira, etc.)? perhaps for spark 2.1.x or spark 2.2.x?

 my general sense is that most cluster admins have plans to migrate to
 java 8 before end of year. so that could line up nicely with spark 2.2


>>>
>>>
>>
>


Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-07 Thread StanZhai
>From thread dump page of Executor of WebUI, I found that there are about 1300
threads named  "DataStreamer for file
/test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
in TIMED_WAITING state like this:

 

The exceed off-heap memory may be caused by these abnormal threads. 

This problem occurs only when writing data to the Hadoop(tasks may be killed
by Executor during writing).

Could this be related to  https://issues.apache.org/jira/browse/HDFS-9812
  ?

It's may be a bug of Spark when killing tasks during writing data. What's
the difference between Spark 1.6.x and 2.1.0 in killing tasks?

This is a critical issue, I've worked on this for days.

Any help?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20881.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: PSA: Java 8 unidoc build

2017-02-07 Thread Reynold Xin
I don't know if this would help but I think we can also officially stop
supporting Java 7 ...


On Tue, Feb 7, 2017 at 1:06 PM, Sean Owen  wrote:

> I believe that if we ran the Jenkins builds with Java 8 we would catch
> these? this doesn't require dropping Java 7 support or anything.
>
> @joshrosen I know we are just now talking about modifying the Jenkins jobs
> to remove old Hadoop configs. Is it possible to change the master jobs to
> use Java 8? can't hurt really in any event.
>
> Or maybe I'm mistaken and they already run Java 8 and it doesn't catch
> this until Java 8 is the target.
>
> Yeah this is going to keep breaking as javadoc 8 is pretty strict. Thanks
> Hyukjin. It has forced us to clean up a lot of sloppy bits of doc though.
>
>
> On Tue, Feb 7, 2017 at 12:13 AM Joseph Bradley 
> wrote:
>
>> Public service announcement: Our doc build has worked with Java 8 for
>> brief time periods, but new changes keep breaking the Java 8 unidoc build.
>> Please be aware of this, and try to test doc changes with Java 8!  In
>> general, it is stricter than Java 7 for docs.
>>
>> A shout out to @HyukjinKwon and others who have made many fixes for
>> this!  See these sample PRs for some issues causing failures (especially
>> around links):
>> https://github.com/apache/spark/pull/16741
>> https://github.com/apache/spark/pull/16604
>>
>> Thanks,
>> Joseph
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>


Re: PSA: Java 8 unidoc build

2017-02-07 Thread Sean Owen
I believe that if we ran the Jenkins builds with Java 8 we would catch
these? this doesn't require dropping Java 7 support or anything.

@joshrosen I know we are just now talking about modifying the Jenkins jobs
to remove old Hadoop configs. Is it possible to change the master jobs to
use Java 8? can't hurt really in any event.

Or maybe I'm mistaken and they already run Java 8 and it doesn't catch this
until Java 8 is the target.

Yeah this is going to keep breaking as javadoc 8 is pretty strict. Thanks
Hyukjin. It has forced us to clean up a lot of sloppy bits of doc though.

On Tue, Feb 7, 2017 at 12:13 AM Joseph Bradley 
wrote:

> Public service announcement: Our doc build has worked with Java 8 for
> brief time periods, but new changes keep breaking the Java 8 unidoc build.
> Please be aware of this, and try to test doc changes with Java 8!  In
> general, it is stricter than Java 7 for docs.
>
> A shout out to @HyukjinKwon and others who have made many fixes for this!
> See these sample PRs for some issues causing failures (especially around
> links):
> https://github.com/apache/spark/pull/16741
> https://github.com/apache/spark/pull/16604
>
> Thanks,
> Joseph
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


Re: Java 9

2017-02-07 Thread Timur Shenkao
If I'm not wrong, they got fid of   *sun.misc.Unsafe   *in Java 9.

This class is till used by several libraries & frameworks.

http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/

On Tue, Feb 7, 2017 at 12:51 PM, Pete Robbins  wrote:

> Yes, I agree but it may be worthwhile starting to look at this. I was just
> trying a build and it trips over some of the now defunct/inaccessible
> sun.misc classes.
>
> I was just interested in hearing if anyone has already gone through this
> to save me duplicating effort.
>
> Cheers,
>
> On Tue, 7 Feb 2017 at 11:46 Sean Owen  wrote:
>
>> I don't think anyone's tried it. I think we'd first have to agree to drop
>> Java 7 support before that could be seriously considered. The 8-9
>> difference is a bit more of a breaking change.
>>
>> On Tue, Feb 7, 2017 at 11:44 AM Pete Robbins  wrote:
>>
>> Is anyone working on support for running Spark on Java 9? Is this in a
>> roadmap anywhere?
>>
>>
>> Cheers,
>>
>>


Re: Java 9

2017-02-07 Thread Pete Robbins
Yes, I agree but it may be worthwhile starting to look at this. I was just
trying a build and it trips over some of the now defunct/inaccessible
sun.misc classes.

I was just interested in hearing if anyone has already gone through this to
save me duplicating effort.

Cheers,

On Tue, 7 Feb 2017 at 11:46 Sean Owen  wrote:

> I don't think anyone's tried it. I think we'd first have to agree to drop
> Java 7 support before that could be seriously considered. The 8-9
> difference is a bit more of a breaking change.
>
> On Tue, Feb 7, 2017 at 11:44 AM Pete Robbins  wrote:
>
> Is anyone working on support for running Spark on Java 9? Is this in a
> roadmap anywhere?
>
>
> Cheers,
>
>


Re: Java 9

2017-02-07 Thread Sean Owen
I don't think anyone's tried it. I think we'd first have to agree to drop
Java 7 support before that could be seriously considered. The 8-9
difference is a bit more of a breaking change.

On Tue, Feb 7, 2017 at 11:44 AM Pete Robbins  wrote:

> Is anyone working on support for running Spark on Java 9? Is this in a
> roadmap anywhere?
>
>
> Cheers,
>


Java 9

2017-02-07 Thread Pete Robbins
Is anyone working on support for running Spark on Java 9? Is this in a
roadmap anywhere?


Cheers,