[
https://issues.apache.org/jira/browse/DRILL-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arina Ielchiieva resolved DRILL-6842.
-------------------------------------
Resolution: Fixed
Fixed in the scope of DRILL-6096.
> Export to CSV using CREATE TABLE AS (CTAS) wrong parsed
> -------------------------------------------------------
>
> Key: DRILL-6842
> URL: https://issues.apache.org/jira/browse/DRILL-6842
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Text & CSV, Storage - Writer
> Affects Versions: 1.14.0
> Environment: - Tested with latest version *Apache Drill* 1.14.0, and
> building the latest version from master (Github repo), commit
> ad61c6bc1dd24994e50fe7dfed043d5e57dba8f9 at _Nov 5, 2018_.
> - *Linux* x64, Ubuntu 16.04
> - *OpenJDK* Runtime Environment (build
> 1.8.0_171-8u171-b11-0ubuntu0.17.10.1-b11)
> - Apache *Maven* 3.5.0
> Reporter: Mariano Ruiz
> Priority: Minor
> Labels: csv, export
> Fix For: 1.17.0
>
> Attachments: Screenshot from 2018-11-09 14-18-43.png
>
>
> When you export to a CSV using CTAS the result of a query, most of the time
> the generated file is OK, but if you have in the results text columns with
> "," characters, the resulting CSV file is broken, because does not enclose
> the cells with commas inside with the " character.
> Steps to reproduce the bug:
> Lets say you have the following table in some source of data, maybe a CSV
> file too:
> {code:title=/tmp/input.csv}
> product_ean,product_name,product_brand
> 12345678900,IPhone X,Apple
> 99999911100,"Samsung S9, Black",Samsung
> 11111223456,Smartwatch XY,Some Brand
> {code}
> Note that the second row of data, in the column "product_name", it has a
> value with a comma inside (_Samsung S9, Black_), so all the cell value is
> enclosed with " characters, while the rest of the column cells aren't,
> despite they could be enclosed too.
> So if you query this file, Drill will interpret correctly the file and does
> not interpret that comma inside the cell as a separator like the rest of the
> commas in the file:
> {code}
> 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/input.csv`;
> +--------------+--------------------+----------------+
> | product_ean | product_name | product_brand |
> +--------------+--------------------+----------------+
> | 12345678900 | IPhone X | Apple |
> | 99999911100 | Samsung S9, Black | Samsung |
> | 11111223456 | Smartwatch XY | Some Brand |
> +--------------+--------------------+----------------+
> 3 rows selected (1.874 seconds)
> {code}
> But now, if you want to query the file and export the result as CSV using the
> CTAS feature, using the following steps:
> {code}
> 0: jdbc:drill:zk=local> USE dfs.tmp;
> +-------+--------------------------------------+
> | ok | summary |
> +-------+--------------------------------------+
> | true | Default schema changed to [dfs.tmp] |
> +-------+--------------------------------------+
> 1 row selected (0.13 seconds)
> 0: jdbc:drill:zk=local> ALTER SESSION SET `store.format`='csv';
> +-------+------------------------+
> | ok | summary |
> +-------+------------------------+
> | true | store.format updated. |
> +-------+------------------------+
> 1 row selected (0.094 seconds)
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.my_output AS SELECT * FROM
> dfs.`/tmp/input.csv`;
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 3 |
> +-----------+----------------------------+
> 1 row selected (0.453 seconds)
> {code}
> The output file is this:
> {code:title=/tmp/my_output/0_0_0.csv}
> product_ean,product_name,product_brand
> 12345678900,IPhone X,Apple
> 99999911100,Samsung S9, Black,Samsung
> 11111223456,Smartwatch XY,Some Brand
> {code}
> The text _Samsung S9, Black_ in the cell is not quoted, so any CSV
> interpreter like an office tool, a Java/Python/... library will interpret it
> as two cell instead of one. Even Apache Drill will interpret it wrong:
> {code}
> 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/my_output/0_0_0.csv`;
> +--------------+----------------+----------------+
> | product_ean | product_name | product_brand |
> +--------------+----------------+----------------+
> | 12345678900 | IPhone X | Apple |
> | 99999911100 | Samsung S9 | Black |
> | 11111223456 | Smartwatch XY | Some Brand |
> +--------------+----------------+----------------+
> 3 rows selected (0.175 seconds)
> {code}
> Note that the ending part _ Black_ was interpreted as a following cell, and
> the real following cell is not showed, but it's not an error in the Drill
> interpreter, it's an error of how Drill exported the result that now in the
> last query was used as input.
> Here is how the file is interpreted by LibreOffice Calc:
> !Screenshot from 2018-11-09 14-18-43.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)