get method guid prefix for file parts for write

2020-09-24 Thread gpongracz
I lack the vocabulary for this question so please bear with my description of
the problem...

I am searching for a way to get the guid prefix value to be used to write
the parts of a file.

eg: 

part-0-b5265e7b-b974-4083-a66e-e7698258ca50-c000.csv

I would like to get the prefix "0-b5265e7b-b974-4083-a66e-e7698258ca50"

Is there a way that I might be able to access such value programatically?

Any assistance is appreciated.

George Pongracz




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Running K8s integration tests for changes in core?

2020-09-24 Thread Hyukjin Kwon
+1

On Fri, 25 Sep 2020, 02:21 Holden Karau,  wrote:

> Thanks Shane!
>
> On Thu, Sep 24, 2020 at 10:17 AM shane knapp ☠ 
> wrote:
>
>> just revisiting this thread...
>>
>> re presubmit strategy:  i don't think this would be easy to set up...
>> and i'm not sure what benefit it will give us.
>>
>> re inadvertent errors:  since we're checking out the same hash from the
>> PR for both builds, and they'll run simultaneously, i don't think it'll be
>> an issue.
>>
>> re overloading the workers:  nah.  the regular PRB takes ~4hr, and the
>> k8s PRB takes ~30m and runs in parallel.
>>
>> i'll set this up right now and keep an eye on the queue/build results
>> today.
>>
>> shane
>>
>> On Thu, Aug 20, 2020 at 2:28 PM Holden Karau 
>> wrote:
>>
>>> Sounds good, thanks for the heads up. I hope you get some time to relax
>>> :)
>>>
>>> On Thu, Aug 20, 2020 at 2:26 PM shane knapp ☠ 
>>> wrote:
>>>
 fyi, i won't be making this change until the 1st week of september.
 i'll be out, off the grid all next week!  :)

 i will send an announcement out tomorrow on how to contact my team
 here @ uc berkeley if jenkins goes down.

 shane

 On Thu, Aug 20, 2020 at 4:40 AM Prashant Sharma 
 wrote:

> Another option is, if we could have something like "presubmit" PR
> build. In other words, running the entire 4 H + K8s integration on each
> commit pushed is too much at the same time and there are chances that one
> thing can inadvertently affect other components(as you just said).
>
> A presubmit(which includes K8s integration tests) build will be run,
> once the PR receives LGTM from "Approved reviewers". This is one criteria
> that comes to my mind, others may have better suggestions.
>
> On Thu, Aug 20, 2020 at 12:25 AM shane knapp ☠ 
> wrote:
>
>> we'll be gated by the number of ubuntu workers w/minikube and docker,
>> but it shouldn't be too bad as the full integration test takes ~45m, vs 
>> 4+
>> hrs for the regular PRB.
>>
>> i can enable this in about 1m of time if the consensus is for us to
>> want this.
>>
>> On Wed, Aug 19, 2020 at 11:37 AM Holden Karau 
>> wrote:
>>
>>> Sounds good. In the meantime would folks committing things in core
>>> run the K8s PRB or run it locally? A second change this morning was
>>> committed that broke the K8s PR tests.
>>>
>>> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma <
>>> scrapco...@gmail.com> wrote:
>>>
 +1, we should enable.

 On Wed, Aug 19, 2020 at 9:18 AM Holden Karau 
 wrote:

> Hi Dev Folks,
>
> I was wondering how people feel about enabling the K8s PRB
> automatically for all core changes? Sometimes I forget that a change 
> might
> impact one of the K8s integration tests since a bunch of them look at 
> log
> messages. Would folks be OK with turning on the K8s integration PRB 
> for all
> core changes as well as K8s changes?
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Running K8s integration tests for changes in core?

2020-09-24 Thread Holden Karau
Thanks Shane!

On Thu, Sep 24, 2020 at 10:17 AM shane knapp ☠  wrote:

> just revisiting this thread...
>
> re presubmit strategy:  i don't think this would be easy to set up...  and
> i'm not sure what benefit it will give us.
>
> re inadvertent errors:  since we're checking out the same hash from the PR
> for both builds, and they'll run simultaneously, i don't think it'll be an
> issue.
>
> re overloading the workers:  nah.  the regular PRB takes ~4hr, and the k8s
> PRB takes ~30m and runs in parallel.
>
> i'll set this up right now and keep an eye on the queue/build results
> today.
>
> shane
>
> On Thu, Aug 20, 2020 at 2:28 PM Holden Karau  wrote:
>
>> Sounds good, thanks for the heads up. I hope you get some time to relax :)
>>
>> On Thu, Aug 20, 2020 at 2:26 PM shane knapp ☠ 
>> wrote:
>>
>>> fyi, i won't be making this change until the 1st week of september.
>>> i'll be out, off the grid all next week!  :)
>>>
>>> i will send an announcement out tomorrow on how to contact my team
>>> here @ uc berkeley if jenkins goes down.
>>>
>>> shane
>>>
>>> On Thu, Aug 20, 2020 at 4:40 AM Prashant Sharma 
>>> wrote:
>>>
 Another option is, if we could have something like "presubmit" PR
 build. In other words, running the entire 4 H + K8s integration on each
 commit pushed is too much at the same time and there are chances that one
 thing can inadvertently affect other components(as you just said).

 A presubmit(which includes K8s integration tests) build will be run,
 once the PR receives LGTM from "Approved reviewers". This is one criteria
 that comes to my mind, others may have better suggestions.

 On Thu, Aug 20, 2020 at 12:25 AM shane knapp ☠ 
 wrote:

> we'll be gated by the number of ubuntu workers w/minikube and docker,
> but it shouldn't be too bad as the full integration test takes ~45m, vs 4+
> hrs for the regular PRB.
>
> i can enable this in about 1m of time if the consensus is for us to
> want this.
>
> On Wed, Aug 19, 2020 at 11:37 AM Holden Karau 
> wrote:
>
>> Sounds good. In the meantime would folks committing things in core
>> run the K8s PRB or run it locally? A second change this morning was
>> committed that broke the K8s PR tests.
>>
>> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma 
>> wrote:
>>
>>> +1, we should enable.
>>>
>>> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau 
>>> wrote:
>>>
 Hi Dev Folks,

 I was wondering how people feel about enabling the K8s PRB
 automatically for all core changes? Sometimes I forget that a change 
 might
 impact one of the K8s integration tests since a bunch of them look at 
 log
 messages. Would folks be OK with turning on the K8s integration PRB 
 for all
 core changes as well as K8s changes?

 Cheers,

 Holden :)

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Running K8s integration tests for changes in core?

2020-09-24 Thread shane knapp ☠
just revisiting this thread...

re presubmit strategy:  i don't think this would be easy to set up...  and
i'm not sure what benefit it will give us.

re inadvertent errors:  since we're checking out the same hash from the PR
for both builds, and they'll run simultaneously, i don't think it'll be an
issue.

re overloading the workers:  nah.  the regular PRB takes ~4hr, and the k8s
PRB takes ~30m and runs in parallel.

i'll set this up right now and keep an eye on the queue/build results today.

shane

On Thu, Aug 20, 2020 at 2:28 PM Holden Karau  wrote:

> Sounds good, thanks for the heads up. I hope you get some time to relax :)
>
> On Thu, Aug 20, 2020 at 2:26 PM shane knapp ☠  wrote:
>
>> fyi, i won't be making this change until the 1st week of september.  i'll
>> be out, off the grid all next week!  :)
>>
>> i will send an announcement out tomorrow on how to contact my team here @
>> uc berkeley if jenkins goes down.
>>
>> shane
>>
>> On Thu, Aug 20, 2020 at 4:40 AM Prashant Sharma 
>> wrote:
>>
>>> Another option is, if we could have something like "presubmit" PR build.
>>> In other words, running the entire 4 H + K8s integration on each commit
>>> pushed is too much at the same time and there are chances that one thing
>>> can inadvertently affect other components(as you just said).
>>>
>>> A presubmit(which includes K8s integration tests) build will be run,
>>> once the PR receives LGTM from "Approved reviewers". This is one criteria
>>> that comes to my mind, others may have better suggestions.
>>>
>>> On Thu, Aug 20, 2020 at 12:25 AM shane knapp ☠ 
>>> wrote:
>>>
 we'll be gated by the number of ubuntu workers w/minikube and docker,
 but it shouldn't be too bad as the full integration test takes ~45m, vs 4+
 hrs for the regular PRB.

 i can enable this in about 1m of time if the consensus is for us to
 want this.

 On Wed, Aug 19, 2020 at 11:37 AM Holden Karau 
 wrote:

> Sounds good. In the meantime would folks committing things in core run
> the K8s PRB or run it locally? A second change this morning was committed
> that broke the K8s PR tests.
>
> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma 
> wrote:
>
>> +1, we should enable.
>>
>> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau 
>> wrote:
>>
>>> Hi Dev Folks,
>>>
>>> I was wondering how people feel about enabling the K8s PRB
>>> automatically for all core changes? Sometimes I forget that a change 
>>> might
>>> impact one of the K8s integration tests since a bunch of them look at 
>>> log
>>> messages. Would folks be OK with turning on the K8s integration PRB for 
>>> all
>>> core changes as well as K8s changes?
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] downtime due to SSL cert errors

2020-09-24 Thread shane knapp ☠
certs delivered and installed...  we're back!

On Wed, Sep 23, 2020 at 6:07 PM shane knapp ☠  wrote:

> jenkins is up and building, but not reachable via https at the moment.
> i'm working on getting this sorted ASAP.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Supporting Row and DataFrame level metadata?

2020-09-24 Thread Jeff Evans
Hi,

I'm wondering if there has been any past discussion on the subject of
supporting metadata attributes as a first class concept, both at the row
level, as well as the DataFrame level?  I did a Jira search, but most of
the items I found were unrelated to this concept, or pertained to column
level metadata, which is of course already supported.

Row-level metadata, would be useful in scenarios like the following:

   - Lineage and provenance attributes, which need to eventually be
   propagated to some other system, but which shouldn't be written out with
   the "regular" DataFrame.
   - Other custom attributes, such as the input_file_name
   
   for data read from HDFS, message keys from Kafka

So why not just store regular the attributes as regular columns (possibly
with some special prefix to help us filter them out if needed)?

   - When passing the DataFrame to another piece of library code, we might
   need to remove those columns, depending on what it does (ex: if it operates
   on every column).  Or we might need to perform an extra join in order to
   "retain" the attributes from the rows processed by the library function.
   - If we need to union an existing DataFrame (with metadata) and another
   one that we read from another source (which has different, or no
   metadata).  If metadata attributes are represented as normal columns, we
   have to do some finagling to get the union to work properly.
   - If we want to simply write the DataFrame somewhere, we probably don't
   want to mix metadata attributes with the actual data.

For DataFrame-level metadata:

   - Attributes such as the table/schema/DB name, or primary key
   information, for DataFrames read from JDBC (ex: downstream processing might
   want to always partitionBy these key columns, whatever they happen to be)
   - Adding tracking information about what app-specific processing steps
   have been applied so far, their timings, etc.
   - For SQL sources, capturing the full query that produced the DataFrame

Some of these scenarios can be made easier by custom code with implicit
conversions, as outlined here .
But that has its own drawbacks and shortcomings (as outlined in the
comments).

How are people currently managing this?  Does it make sense, conceptually,
as something that Spark should directly support?