[jira] [Commented] (IOTDB-842) Better Import-CSV Tool

Xiangdong Huang (Jira) Thu, 12 Aug 2021 01:42:04 -0700


    [ 
https://issues.apache.org/jira/browse/IOTDB-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397931#comment-17397931
 ]


Xiangdong Huang commented on IOTDB-842:
---------------------------------------

New User Interface Design:


# CSV Tool

The CSV tool can help you import data in CSV format to IoTDB or export data 
from IoTDB to a CSV file.

## Usage of export-csv.sh

### Syntax

```shell
# Unix/OS X
> tools/export-csv.sh  -h <ip> -p <port> -u <username> -pw <password> -td 
> <directory> [-tf <time-format> -datatype <true/false> -q <query command> -s 
> <sql file>]

# Windows
> tools\export-csv.bat -h <ip> -p <port> -u <username> -pw <password> -td 
> <directory> [-tf <time-format> -datatype <true/false> -q <query command> -s 
> <sql file>]
```

Description:

* `-datatype`:
    - true (by default): print the data type of timesries in the head line of 
CSV file. i.e., `Time, root.sg1.d1.s1 (int), root.sg1.d1.s2(long)`.
    - false: only print the timeseries name in the head line of the CSV file. 
i.e., `Time, root.sg1.d1.s1 , root.sg1.d1.s2`
* `-q <query command>`:
    - specifying a query command that you want to execute
    - example: `select * from root limit 100`, or `select * from root limit 100 
align by device`
* `-s <sql file>`:
    - specifying a SQL file which can consist of more than one sql. If there 
are multiple SQLs in one SQL file, the SQLs should be separated by line breaks. 
And, for each SQL, a output CSV file will be generated.
* `-td <directory>`:
    - specifying  the directory that the data will be exported
* `-tf <time-format>`:
    - specifying a time format that you want. The time format have to obey [ISO 
8601](https://calendars.wikia.org/wiki/ISO_8601) standard. If you want to save 
the time as the timestamp, then setting `-tf timestamp`
    - example: `-tf yyyy-MM-dd\ HH:mm:ss` or `-tf timestamp`

More, if you don't use one of `-s` and `-q`, you need to enter some queries 
after running the export script. The results of the different query will be 
saved to different CSV files.

### example

```shell
# Unix/OS X
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./
# Or
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -tf 
> yyyy-MM-dd\ HH:mm:ss
# or
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -q "select * 
> from root"
# Or
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -s sql.txt
# Or
> tools/export-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -tf 
> yyyy-MM-dd\ HH:mm:ss -s sql.txt

# Windows
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./
# Or
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -tf 
> yyyy-MM-dd\ HH:mm:ss
# or
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -q "select 
> * from root"
# Or
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -s sql.txt
# Or
> tools/export-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -td ./ -tf 
> yyyy-MM-dd\ HH:mm:ss -s sql.txt
```

### Sample SQL file

```sql
select * from root;
select * from root align by device;
```

The result of `select * from root`

```sql
Time,root.ln.wf04.wt04.status (BOOLEAN),root.ln.wf03.wt03.hardware 
(TEXT),root.ln.wf02.wt02.status (BOOLEAN),root.ln.wf02.wt02.hardware 
(TEXT),root.ln.wf01.wt01.hardware (TEXT),root.ln.wf01.wt01.status (BOOLEAN)
1970-01-01T08:00:00.001+08:00,true,"v1",true,"v1",v1,true
1970-01-01T08:00:00.002+08:00,true,"v1",,,,true
```

The result of `select * from root align by device`

```sql
Time,Device,hardware (TEXT),status (BOOLEAN)
1970-01-01T08:00:00.001+08:00,root.ln.wf01.wt01,"v1",true
1970-01-01T08:00:00.002+08:00,root.ln.wf01.wt01,,true
1970-01-01T08:00:00.001+08:00,root.ln.wf02.wt02,"v1",true
1970-01-01T08:00:00.001+08:00,root.ln.wf03.wt03,"v1",
1970-01-01T08:00:00.002+08:00,root.ln.wf03.wt03,"v1",
1970-01-01T08:00:00.001+08:00,root.ln.wf04.wt04,,true
1970-01-01T08:00:00.002+08:00,root.ln.wf04.wt04,,true
```

The data of boolean type signed by `true` and `false` without double quotes. 
And the text data will be enclosed in double quotes.

### Note

Note that if fields exported by the export tool have the following special 
characters:

1. `,`: the field will be enclosed by `"`.
2. `"`: the field will be enclosed by `"` and the original characters `"` in 
the field will be replaced by `\"`.

## Usage of import-csv.sh

### Create metadata (optional)

```sql
SET STORAGE GROUP TO root.fit.d1;
SET STORAGE GROUP TO root.fit.d2;
SET STORAGE GROUP TO root.fit.p;
CREATE TIMESERIES root.fit.d1.s1 WITH DATATYPE=INT32,ENCODING=RLE;
CREATE TIMESERIES root.fit.d1.s2 WITH DATATYPE=TEXT,ENCODING=PLAIN;
CREATE TIMESERIES root.fit.d2.s1 WITH DATATYPE=INT32,ENCODING=RLE;
CREATE TIMESERIES root.fit.d2.s3 WITH DATATYPE=INT32,ENCODING=RLE;
CREATE TIMESERIES root.fit.p.s1 WITH DATATYPE=INT32,ENCODING=RLE;
```

IoTDB has the ability of type inference, so it is not necessary to create 
metadata before data import. However, we still recommend creating metadata 
before importing data using the CSV import tool, as this can avoid unnecessary 
type conversion errors.

### Sample CSV file to be imported

```sql
Time,root.fit.d1.s1,root.fit.d1.s2,root.fit.d2.s1,root.fit.d2.s3,root.fit.p.s1
1,100,"hello",200,300,400
2,500,"world",600,700,800
3,900,"hello, \"world\"",1000,1100,1200
```

### Syntax

```shell
# Unix/OS X
> tools/import-csv.sh -h <ip> -p <port> -u <username> -pw <password> -f 
> <xxx.csv> [-fd <yyy.csv> -policy <1/2> -table-aligned <time/device>]

# Windows
> tools\import-csv.bat -h <ip> -p <port> -u <username> -pw <password> -f 
> <xxx.csv> [-fd <yyy.csv> -policy <1/2> -table-aligned <time/device>]
```

Description:

* `-f`:
  - the CSV file that you want to import
  - example: `-f filename.csv`

 * `-fd`:
   - export0.csv.failed (by default): specifying a file to save failed lines
   - example: `-fd failed-filename.csv`
* `-policy`:
  - 1 (by default): you can directly use the this policy, if the CSV file has 
defined the data type of timeseries in the head of CSV file.  i.e., `Time, 
root.sg1.d1.s1 (int), root.sg1.d1.s2(long)`.
  - 2: make the program do type inference
* `-table-aligned`
  - `time` (by default): the data that you want to import like the result of 
`select * from root`
  - `device`: the data that you want to import like the result of `select * 
from root align by device`

### Example

```sh
# Unix/OS X
> tools/import-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -f 
> example-filename.csv -fd failed-filename.csv
# or
> tools/import-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -f 
> example-filename.csv -fd failed-filename.csv -policy 2
# or
> tools/import-csv.sh -h 127.0.0.1 -p 6667 -u root -pw root -f 
> example-filename.csv -fd failed-filename.csv -policy 2
-table-aligned device

# Windows
> tools\import-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -f 
> example-filename.csv
# or
> tools/import-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -f 
> example-filename.csv -fd failed-filename.csv -policy 2
# or
> tools/import-csv.bat -h 127.0.0.1 -p 6667 -u root -pw root -f 
> example-filename.csv -fd failed-filename.csv -policy 2
-table-aligned device
```

### Note

Note that the following special characters in fields need to be checked before 
importing:

1. `,` : fields containing `,` should be quoted by a pair of `"` or a pair of 
`'`.
2. `"` : `"` in fields should be replaced by `\"` or fields should be enclosed 
by `'`.
3. `'` : `'` in fields should be replaced by `\'` or fields should be enclosed 
by `"`.
4. you can input time format like `yyyy-MM-dd'T'HH:mm:ss`, `yyy-MM-dd 
HH:mm:ss`, or `yyyy-MM-dd'T'HH:mm:ss.SSSZ`.


> Better Import-CSV Tool
> ----------------------
>
>                 Key: IOTDB-842
>                 URL: https://issues.apache.org/jira/browse/IOTDB-842
>             Project: Apache IoTDB
>          Issue Type: Task
>          Components: Tools/Others
>            Reporter: Xiangdong Huang
>            Priority: Minor
>
> Hi, our import-csv tool is currently implemented by JDBC and requires a 
> fossil format:
> e.g., 
> {code:java}
> Time,root.sg.d1.s1,root.sg.d1.s2,root.sg.d2.s1,root.sg.d2.s2,root.sg.d2.s3
> 2020-08-18T10:22:31.603+08:00,1,2.0,null,null,null
> 2020-08-18T10:22:35.631+08:00,1,2.0,null,null,null
> 2020-08-18T10:22:41.093+08:00,null,null,1,2.0,null
> 2020-08-18T10:22:52.603+08:00,null,null,1,2.0,true
> {code}
> Requirement 1:
> As we support 3 kinds of output format: align all series (by default), align 
> by device, without alignment, it is better to support such 3 kinds of 
> import-csv format:
> a. 
> {code:java}
> Time,root.sg.d1.s1,root.sg.d1.s2,root.sg.d2.s1,root.sg.d2.s2,root.sg.d2.s3
> 2020-08-18T10:22:31.603+08:00,1,2.0,null,null,null
> 2020-08-18T10:22:35.631+08:00,1,2.0,null,null,null
> 2020-08-18T10:22:41.093+08:00,null,null,1,2.0,null
> 2020-08-18T10:22:52.603+08:00,null,null,1,2.0,true
> {code}
> b. 
> {code:java}
> Time,Device,s1,s2,s3
> 2020-08-18T10:22:31.603+08:00,root.sg.d1,1,2.0,null
> 2020-08-18T10:22:35.631+08:00,root.sg.d1,1,2.0,null
> 2020-08-18T10:22:41.093+08:00,root.sg.d2,1,2.0,null
> 2020-08-18T10:22:52.603+08:00,root.sg.d2,1,2.0,true
> {code}
> c.
> (it is strange, I'd like to do not support such format.)
> Requment2:
> Different users may have different time formats for the first column.
> So, we'd better support different kinds of time format. e.g., let users 
> define how to parse their timestamp: yyyy-MM-ddHH:mm:ss.SSS etc..
> Requirement 3:
> Support NULL as well as empty char to describe the null data point. For 
> example, the following  3 lines are the same:
> 2020-08-18T10:22:31.603+08:00,root.sg.d1,1,null,null
> 2020-08-18T10:22:31.603+08:00,root.sg.d1,1,,
> 2020-08-18T10:22:31.603+08:00,root.sg.d1,1,    ,
> Requirement 4:
> Support claiming the storage group name once rather than repeat the storage 
> group name for each line:
> e.g., for format b, we can tell the tool the sg is `root.sg` and then each 
> row looks like:
> 2020-08-18T10:22:35.631+08:00,d1,1,2.0,null
> Another option is add a new column called storage_group for each row.
> For UT:
> 1. all data type should be covered;
> 2. incorrect csv format should be covered;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IOTDB-842) Better Import-CSV Tool

Reply via email to