Re: pyspark DataFrameWriter ignores customized settings?

2018-03-16 Thread chhsiao1981
Hi all,

Found the answer from the following link:

https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html

I can successfully setup parquet block size with
spark.hadoop.parquet.block.size.

The following is the sample code:

# init
block_size = 512 * 1024 

conf =
SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max',
20).set("spark.executor.cores", 10).set("spark.executor.memory",
"10g").set('spark.hadoop.parquet.block.size',
str(block_size)).set("spark.hadoop.dfs.blocksize",
str(block_size)).set("spark.hadoop.dfs.block.size",
str(block_size)).set("spark.hadoop.dfs.namenode.fs-limits.min-block-size",
str(131072))

sc = SparkContext(conf=conf) 
spark = SparkSession(sc) 

# create DataFrame 
df_txt = spark.createDataFrame([{'temp': "hello"}, {'temp': "world"},
{'temp': "!"}]) 

# save using DataFrameWriter, resulting 512k-block-size 

df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
OK, will do.

On Fri, Mar 16, 2018 at 4:41 PM Sean Owen  wrote:

> I think you can file a JIRA and open a PR. All of the bits that use "gpg
> ... SHA512 file ..." can use shasum instead.
> I would not change any existing release artifacts though.
>
> On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I have sha512sum on my Mac via Homebrew, but yeah as long as the format
>> is the same I suppose it doesn’t matter if we use shasum -a or sha512sum.
>>
>> So shall I file a JIRA + PR for this? Or should I leave the PR to a
>> maintainer? And are we OK with updating all the existing release hashes to
>> use the new format, or do we only want to do this for new releases?
>> ​
>>
>> On Fri, Mar 16, 2018 at 1:50 PM Felix Cheung 
>> wrote:
>>
>>> +1 there
>>>
>>> --
>>> *From:* Sean Owen 
>>> *Sent:* Friday, March 16, 2018 9:51:49 AM
>>> *To:* Felix Cheung
>>> *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list
>>>
>>> *Subject:* Re: Changing how we compute release hashes
>>> I think the issue with that is that OS X doesn't have "sha512sum". Both
>>> it and Linux have "shasum -a 512" though.
>>>
>>> On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung 
>>> wrote:
>>>
 Instead of using gpg to create the sha512 hash file we could just
 change to using sha512sum? That would output the right format that is in
 turns verifiable.


 --
 *From:* Ryan Blue 
 *Sent:* Friday, March 16, 2018 8:31:45 AM
 *To:* Nicholas Chammas
 *Cc:* Spark dev list
 *Subject:* Re: Changing how we compute release hashes

 +1 It's possible to produce the same file with gpg, but the sha*sum
 utilities are a bit easier to remember the syntax for.

 On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> To verify that I’ve downloaded a Hadoop release correctly, I can just
> do this:
>
> $ shasum --check hadoop-2.7.5.tar.gz.sha256
> hadoop-2.7.5.tar.gz: OK
>
> However, since we generate Spark release hashes with GPG
> ,
> the resulting hash is in a format that doesn’t play well with any tools:
>
> $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
> shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
> checksum lines found
>
> GPG doesn’t seem to offer a way to verify a file from a hash.
>
> I know I can always manipulate the SHA512 hash into a different format
> or just manually inspect it, but as a “quality of life” improvement can we
> change how we generate the SHA512 hash so that it plays nicely with
> shasum? If it’s too disruptive to change the format of the SHA512
> hash, can we add a SHA256 hash to our releases in this format?
>
> I suppose if it’s not easy to update or add hashes to our existing
> releases, it may be too difficult to change anything here. But I’m not
> sure, so I thought I’d ask.
>
> Nick
> ​
>



 --
 Ryan Blue
 Software Engineer
 Netflix

>>>


Re: Changing how we compute release hashes

2018-03-16 Thread Sean Owen
I think you can file a JIRA and open a PR. All of the bits that use "gpg
... SHA512 file ..." can use shasum instead.
I would not change any existing release artifacts though.

On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas 
wrote:

> I have sha512sum on my Mac via Homebrew, but yeah as long as the format
> is the same I suppose it doesn’t matter if we use shasum -a or sha512sum.
>
> So shall I file a JIRA + PR for this? Or should I leave the PR to a
> maintainer? And are we OK with updating all the existing release hashes to
> use the new format, or do we only want to do this for new releases?
> ​
>
> On Fri, Mar 16, 2018 at 1:50 PM Felix Cheung 
> wrote:
>
>> +1 there
>>
>> --
>> *From:* Sean Owen 
>> *Sent:* Friday, March 16, 2018 9:51:49 AM
>> *To:* Felix Cheung
>> *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list
>>
>> *Subject:* Re: Changing how we compute release hashes
>> I think the issue with that is that OS X doesn't have "sha512sum". Both
>> it and Linux have "shasum -a 512" though.
>>
>> On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung 
>> wrote:
>>
>>> Instead of using gpg to create the sha512 hash file we could just change
>>> to using sha512sum? That would output the right format that is in turns
>>> verifiable.
>>>
>>>
>>> --
>>> *From:* Ryan Blue 
>>> *Sent:* Friday, March 16, 2018 8:31:45 AM
>>> *To:* Nicholas Chammas
>>> *Cc:* Spark dev list
>>> *Subject:* Re: Changing how we compute release hashes
>>>
>>> +1 It's possible to produce the same file with gpg, but the sha*sum
>>> utilities are a bit easier to remember the syntax for.
>>>
>>> On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 To verify that I’ve downloaded a Hadoop release correctly, I can just
 do this:

 $ shasum --check hadoop-2.7.5.tar.gz.sha256
 hadoop-2.7.5.tar.gz: OK

 However, since we generate Spark release hashes with GPG
 ,
 the resulting hash is in a format that doesn’t play well with any tools:

 $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
 shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
 checksum lines found

 GPG doesn’t seem to offer a way to verify a file from a hash.

 I know I can always manipulate the SHA512 hash into a different format
 or just manually inspect it, but as a “quality of life” improvement can we
 change how we generate the SHA512 hash so that it plays nicely with
 shasum? If it’s too disruptive to change the format of the SHA512
 hash, can we add a SHA256 hash to our releases in this format?

 I suppose if it’s not easy to update or add hashes to our existing
 releases, it may be too difficult to change anything here. But I’m not
 sure, so I thought I’d ask.

 Nick
 ​

>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>


Re: Live Stream Code Review today (in like ~5 minutes)

2018-03-16 Thread Holden Karau
Ok and the recording is now being processed and will be posted at the same
URL once its done (  https://www.youtube.com/watch?v=pXzVtEUjrLc ). You can
also see a walk through with Cody merging his first PR (
https://www.youtube.com/watch?v=_SdNu7MezL4 ).

Since I had a slight problem during the live review this morning and only
had time to look at one I'm going to continue and do another one this
afternoon at 4pm pacific at https://www.youtube.com/watch?v=4kUJrhRFoJg.

On Fri, Mar 16, 2018 at 10:58 AM, Holden Karau  wrote:

> I'm going to be doing another live stream code review today in ~5 minutes.
> You can join watch at  https://www.youtube.com/watch?v=pXzVtEUjrLc & the
> result will be posted as well. In this review I'll look at PRs in both the
> Spark project and a related project, spark-testing-base.
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
I have sha512sum on my Mac via Homebrew, but yeah as long as the format is
the same I suppose it doesn’t matter if we use shasum -a or sha512sum.

So shall I file a JIRA + PR for this? Or should I leave the PR to a
maintainer? And are we OK with updating all the existing release hashes to
use the new format, or do we only want to do this for new releases?
​

On Fri, Mar 16, 2018 at 1:50 PM Felix Cheung 
wrote:

> +1 there
>
> --
> *From:* Sean Owen 
> *Sent:* Friday, March 16, 2018 9:51:49 AM
> *To:* Felix Cheung
> *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list
>
> *Subject:* Re: Changing how we compute release hashes
> I think the issue with that is that OS X doesn't have "sha512sum". Both it
> and Linux have "shasum -a 512" though.
>
> On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung 
> wrote:
>
>> Instead of using gpg to create the sha512 hash file we could just change
>> to using sha512sum? That would output the right format that is in turns
>> verifiable.
>>
>>
>> --
>> *From:* Ryan Blue 
>> *Sent:* Friday, March 16, 2018 8:31:45 AM
>> *To:* Nicholas Chammas
>> *Cc:* Spark dev list
>> *Subject:* Re: Changing how we compute release hashes
>>
>> +1 It's possible to produce the same file with gpg, but the sha*sum
>> utilities are a bit easier to remember the syntax for.
>>
>> On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> To verify that I’ve downloaded a Hadoop release correctly, I can just do
>>> this:
>>>
>>> $ shasum --check hadoop-2.7.5.tar.gz.sha256
>>> hadoop-2.7.5.tar.gz: OK
>>>
>>> However, since we generate Spark release hashes with GPG
>>> ,
>>> the resulting hash is in a format that doesn’t play well with any tools:
>>>
>>> $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
>>> shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
>>> checksum lines found
>>>
>>> GPG doesn’t seem to offer a way to verify a file from a hash.
>>>
>>> I know I can always manipulate the SHA512 hash into a different format
>>> or just manually inspect it, but as a “quality of life” improvement can we
>>> change how we generate the SHA512 hash so that it plays nicely with
>>> shasum? If it’s too disruptive to change the format of the SHA512 hash,
>>> can we add a SHA256 hash to our releases in this format?
>>>
>>> I suppose if it’s not easy to update or add hashes to our existing
>>> releases, it may be too difficult to change anything here. But I’m not
>>> sure, so I thought I’d ask.
>>>
>>> Nick
>>> ​
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Re: pyspark DataFrameWriter ignores customized settings?

2018-03-16 Thread chhsiao1981
Hi all,

Looks like it's parquet-specific issue.

I can successfully write with 512k block-size
if I use df.write.csv() or use df.write.text()
(I can successfully do csv write when I put hadoop-lzo-0.4.15-cdh5.13.0.jar
into the jars dir)

sample code:


block_size = 512 * 1024

conf =
SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max',
20).set("spark.executor.cores", 10).set("spark.executor.memory",
"10g").set("spark.hadoop.dfs.blocksize",
str(block_size)).set("spark.hadoop.dfs.block.size",
str(block_size)).set("spark.hadoop.dfs.namenode.fs-limits.min-block-size",
str(131072))

sc = SparkContext(conf=conf)
spark = SparkSession(sc)

# create DataFrame
df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"},
\{'temp': "!"}])

# save using DataFrameWriter, resulting 128MB-block-size
df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')

# save using DataFrameWriter.csv, resulting 512k-block-size
df_txt.write.mode('overwrite').csv('hdfs://spark1/tmp/temp_with_df_csv')

# save using DataFrameWriter.text, resulting 512k-block-size
df_txt.write.mode('overwrite').text('hdfs://spark1/tmp/temp_with_df_text')

# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Live Stream Code Review today (in like ~5 minutes)

2018-03-16 Thread Holden Karau
I'm going to be doing another live stream code review today in ~5 minutes.
You can join watch at  https://www.youtube.com/watch?v=pXzVtEUjrLc & the
result will be posted as well. In this review I'll look at PRs in both the
Spark project and a related project, spark-testing-base.

-- 
Twitter: https://twitter.com/holdenkarau


Re: Changing how we compute release hashes

2018-03-16 Thread Felix Cheung
+1 there


From: Sean Owen 
Sent: Friday, March 16, 2018 9:51:49 AM
To: Felix Cheung
Cc: rb...@netflix.com; Nicholas Chammas; Spark dev list
Subject: Re: Changing how we compute release hashes

I think the issue with that is that OS X doesn't have "sha512sum". Both it and 
Linux have "shasum -a 512" though.

On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung 
> wrote:
Instead of using gpg to create the sha512 hash file we could just change to 
using sha512sum? That would output the right format that is in turns verifiable.



From: Ryan Blue 
Sent: Friday, March 16, 2018 8:31:45 AM
To: Nicholas Chammas
Cc: Spark dev list
Subject: Re: Changing how we compute release hashes

+1 It's possible to produce the same file with gpg, but the sha*sum utilities 
are a bit easier to remember the syntax for.

On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas 
> wrote:

To verify that I’ve downloaded a Hadoop release correctly, I can just do this:

$ shasum --check hadoop-2.7.5.tar.gz.sha256
hadoop-2.7.5.tar.gz: OK


However, since we generate Spark release hashes with 
GPG,
 the resulting hash is in a format that doesn’t play well with any tools:

$ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
checksum lines found


GPG doesn’t seem to offer a way to verify a file from a hash.

I know I can always manipulate the SHA512 hash into a different format or just 
manually inspect it, but as a “quality of life” improvement can we change how 
we generate the SHA512 hash so that it plays nicely with shasum? If it’s too 
disruptive to change the format of the SHA512 hash, can we add a SHA256 hash to 
our releases in this format?

I suppose if it’s not easy to update or add hashes to our existing releases, it 
may be too difficult to change anything here. But I’m not sure, so I thought 
I’d ask.

Nick

​



--
Ryan Blue
Software Engineer
Netflix


Re: Changing how we compute release hashes

2018-03-16 Thread Sean Owen
I think the issue with that is that OS X doesn't have "sha512sum". Both it
and Linux have "shasum -a 512" though.

On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung 
wrote:

> Instead of using gpg to create the sha512 hash file we could just change
> to using sha512sum? That would output the right format that is in turns
> verifiable.
>
>
> --
> *From:* Ryan Blue 
> *Sent:* Friday, March 16, 2018 8:31:45 AM
> *To:* Nicholas Chammas
> *Cc:* Spark dev list
> *Subject:* Re: Changing how we compute release hashes
>
> +1 It's possible to produce the same file with gpg, but the sha*sum
> utilities are a bit easier to remember the syntax for.
>
> On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> To verify that I’ve downloaded a Hadoop release correctly, I can just do
>> this:
>>
>> $ shasum --check hadoop-2.7.5.tar.gz.sha256
>> hadoop-2.7.5.tar.gz: OK
>>
>> However, since we generate Spark release hashes with GPG
>> ,
>> the resulting hash is in a format that doesn’t play well with any tools:
>>
>> $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
>> shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
>> checksum lines found
>>
>> GPG doesn’t seem to offer a way to verify a file from a hash.
>>
>> I know I can always manipulate the SHA512 hash into a different format or
>> just manually inspect it, but as a “quality of life” improvement can we
>> change how we generate the SHA512 hash so that it plays nicely with
>> shasum? If it’s too disruptive to change the format of the SHA512 hash,
>> can we add a SHA256 hash to our releases in this format?
>>
>> I suppose if it’s not easy to update or add hashes to our existing
>> releases, it may be too difficult to change anything here. But I’m not
>> sure, so I thought I’d ask.
>>
>> Nick
>> ​
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Changing how we compute release hashes

2018-03-16 Thread Felix Cheung
Instead of using gpg to create the sha512 hash file we could just change to 
using sha512sum? That would output the right format that is in turns verifiable.



From: Ryan Blue 
Sent: Friday, March 16, 2018 8:31:45 AM
To: Nicholas Chammas
Cc: Spark dev list
Subject: Re: Changing how we compute release hashes

+1 It's possible to produce the same file with gpg, but the sha*sum utilities 
are a bit easier to remember the syntax for.

On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas 
> wrote:

To verify that I’ve downloaded a Hadoop release correctly, I can just do this:

$ shasum --check hadoop-2.7.5.tar.gz.sha256
hadoop-2.7.5.tar.gz: OK


However, since we generate Spark release hashes with 
GPG,
 the resulting hash is in a format that doesn’t play well with any tools:

$ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
checksum lines found


GPG doesn’t seem to offer a way to verify a file from a hash.

I know I can always manipulate the SHA512 hash into a different format or just 
manually inspect it, but as a “quality of life” improvement can we change how 
we generate the SHA512 hash so that it plays nicely with shasum? If it’s too 
disruptive to change the format of the SHA512 hash, can we add a SHA256 hash to 
our releases in this format?

I suppose if it’s not easy to update or add hashes to our existing releases, it 
may be too difficult to change anything here. But I’m not sure, so I thought 
I’d ask.

Nick

​



--
Ryan Blue
Software Engineer
Netflix


Re: Changing how we compute release hashes

2018-03-16 Thread Ryan Blue
+1 It's possible to produce the same file with gpg, but the sha*sum
utilities are a bit easier to remember the syntax for.

On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> To verify that I’ve downloaded a Hadoop release correctly, I can just do
> this:
>
> $ shasum --check hadoop-2.7.5.tar.gz.sha256
> hadoop-2.7.5.tar.gz: OK
>
> However, since we generate Spark release hashes with GPG
> ,
> the resulting hash is in a format that doesn’t play well with any tools:
>
> $ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
> shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
> checksum lines found
>
> GPG doesn’t seem to offer a way to verify a file from a hash.
>
> I know I can always manipulate the SHA512 hash into a different format or
> just manually inspect it, but as a “quality of life” improvement can we
> change how we generate the SHA512 hash so that it plays nicely with shasum?
> If it’s too disruptive to change the format of the SHA512 hash, can we add
> a SHA256 hash to our releases in this format?
>
> I suppose if it’s not easy to update or add hashes to our existing
> releases, it may be too difficult to change anything here. But I’m not
> sure, so I thought I’d ask.
>
> Nick
> ​
>



-- 
Ryan Blue
Software Engineer
Netflix