[ 
https://issues.apache.org/jira/browse/SPARK-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13766:
---------------------------------
    Description: 
Currently, the output (part-files) from CSV, TEXT and JSON data sources do not 
have file extensions such as .csv, .txt and .json (except for compression 
extensions such as .gz, .deflate and .bz4).

In addition, it looks Parquet has the extensions (in part-files) such as 
.gz.parquet or .snappy.parquet according to compression codecs whereas ORC does 
not have such extensions but it is just .orc.

So, in a simple view, currently the extensions are set as below:

{code}
TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
Parquet -  [.COMPRESSION_CODEC_NAME].parquet
ORC - .orc
{code}

It would be great if we have a consistent naming for them

  was:
Currently, the output (part-files) from CSV, TEXT and JSON data sources do not 
have file extensions such as .csv, .txt and .json (except for compression 
extensions such as .gz, .deflate and .bz4).

In addition, it looks Parquet has the extensions (in part-files) such as 
.gz.parquet or .snappy.parquet according to compression codecs whereas ORC does 
not have such extensions but it is just .orc.

So, in a simple view,

{code}
TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
Parquet -  [.COMPRESSION_CODEC_NAME].parquet
ORC - .orc
{code}

It would be great if we have a consistent naming for them


> Inconsistent file extensions and omitting file extensions written by CSV, 
> TEXT and JSON data sources
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-13766
>                 URL: https://issues.apache.org/jira/browse/SPARK-13766
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> Currently, the output (part-files) from CSV, TEXT and JSON data sources do 
> not have file extensions such as .csv, .txt and .json (except for compression 
> extensions such as .gz, .deflate and .bz4).
> In addition, it looks Parquet has the extensions (in part-files) such as 
> .gz.parquet or .snappy.parquet according to compression codecs whereas ORC 
> does not have such extensions but it is just .orc.
> So, in a simple view, currently the extensions are set as below:
> {code}
> TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
> Parquet -  [.COMPRESSION_CODEC_NAME].parquet
> ORC - .orc
> {code}
> It would be great if we have a consistent naming for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to