[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

sujith71955 Fri, 14 Sep 2018 11:27:27 -0700

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22396#discussion_r217802972
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
       - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
       - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
       - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
    +  - Since Spark 2.4 load command from local filesystem supports wildcards 
in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in 
Older versions space in folder/file names has been represented using '%20'(e.g. 
LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be 
supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space 
character in folder/file names (e.g. LOAD DATA INPATH 
'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. 
(e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv')
    --- End diff --
    
    @cloud-fan We follow the same syntax as old versions for Load command path, 
except in older versions user was not able to provide wildcard characters in 
folder level of the local fs , Now we do support  with our new implementation 
and even in hdfs we do support the same syntax. So now it is consistent. All 
the usage which i mentioned can be applied in both local and hdfs file systems. 
 Now the usages are more consistent compare to older versions.
    
    For more details please refer below PR let me know for any clarifications. 
Thanks
    https://github.com/apache/spark/pull/20611



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...

Reply via email to