[jira] [Updated] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode

Jim Huang (Jira) Thu, 26 Mar 2020 16:16:25 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jim Huang updated SPARK-31276:
------------------------------
    Description: 
This Spark SQL Guide --> Data sources --> Generic Load/Save Functions

[https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]

described a very simple "local file system load of an example file".  

 

I am looking for an example that demonstrates a workflow that exercises 
different file systems.  For example, 
 # Driver loads an input file from local file system
 # Add a simple column using lit() and stores that DataFrame in cluster mode to 
HDFS
 # Write that a small limited subset of that DataFrame back to Driver's local 
file system.  (This is to avoid the anti-pattern of writing large file and out 
of the scope for this example.  The small limited DataFrame would be some basic 
statistics, not the actual complete dataset.)

 

The examples I found on the internet only uses simple paths without the 
explicit URI prefixes.

Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) was 
called, local stand alone vs YARN client mode.   So a "filepath" will be 
read/write locally (file system) vs cluster mode HDFS, without these explicit 
URIs.

There are situations were a Spark program needs to deal with both local file 
system and YARN client mode (big data) in the same Spark application, like 
producing a summary table stored on the local file system of the driver at the 
end.  

If there are any existing alternatives Spark documentation that provides 
examples that traverse through the different URIs in Spark YARN client mode or 
a better or smarter Spark pattern or API that is more suited for this, I am 
happy to accept that as well.  Thanks!

  was:
This Spark SQL Guide --> Data sources --> Generic Load/Save Functions

[https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]

described a very simple "local file system load of an example file".  

 

I am looking for an example that demonstrates a workflow that exercises 
different file systems.  For example, 
 # Driver loads an input file from local file system
 # Add a simple column using lit() and stores that DataFrame in cluster mode to 
HDFS
 # Write that same final DataFrame back to Driver's local file system

 

The examples I found on the internet only uses simple paths without the 
explicit URI prefixes.

Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) was 
called, local stand alone vs cluster mode.   So a "filepath" will be read/write 
locally (file system) vs cluster mode HDFS, without these explicit URIs.

There are situations were a Spark program needs to deal with both local file 
system and cluster mode (big data) in the same Spark application, like 
producing a summary table stored on the local file system of the driver at the 
end.  

If there are any existing alternatives Spark documentation that provides 
examples of different URIs, I am happy to accept that as well.  Thanks!


> Contrived working example that works with multiple URI file storages for 
> Spark cluster mode
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31276
>                 URL: https://issues.apache.org/jira/browse/SPARK-31276
>             Project: Spark
>          Issue Type: Wish
>          Components: Examples
>    Affects Versions: 2.4.5
>            Reporter: Jim Huang
>            Priority: Major
>
> This Spark SQL Guide --> Data sources --> Generic Load/Save Functions
> [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
> described a very simple "local file system load of an example file".  
>  
> I am looking for an example that demonstrates a workflow that exercises 
> different file systems.  For example, 
>  # Driver loads an input file from local file system
>  # Add a simple column using lit() and stores that DataFrame in cluster mode 
> to HDFS
>  # Write that a small limited subset of that DataFrame back to Driver's local 
> file system.  (This is to avoid the anti-pattern of writing large file and 
> out of the scope for this example.  The small limited DataFrame would be some 
> basic statistics, not the actual complete dataset.)
>  
> The examples I found on the internet only uses simple paths without the 
> explicit URI prefixes.
> Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) 
> was called, local stand alone vs YARN client mode.   So a "filepath" will be 
> read/write locally (file system) vs cluster mode HDFS, without these explicit 
> URIs.
> There are situations were a Spark program needs to deal with both local file 
> system and YARN client mode (big data) in the same Spark application, like 
> producing a summary table stored on the local file system of the driver at 
> the end.  
> If there are any existing alternatives Spark documentation that provides 
> examples that traverse through the different URIs in Spark YARN client mode 
> or a better or smarter Spark pattern or API that is more suited for this, I 
> am happy to accept that as well.  Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode

Reply via email to