morningman opened a new issue #3549:
URL: https://github.com/apache/incubator-doris/issues/3549


   ## Motivation
   
   Doris' current data export method is relatively simple, and data can only be 
exported through the `Export` feature. The `Export` feature can only export 
data in partition level, and does not support advanced SQL functions such as 
Expressions, Joins, and Group By.
   
   In some cases, Doris users will have the function to export the query 
results:
   
   1. Export the query results, provide other users to download.
   2. Export query results for further processing by other systems.
   3. Bitmap type export: Sometimes users store a large number of ids (100 
million +) through bitmap, and need to return these ids. At present, Doris does 
not support UDTF. And If it is returned through the MySQL protocol, it will not 
be processed because the result set is too large.
   
   ## Alternatives
   
   ### Support `SELECT ... FROM tbl INTO OUTFILE xxx ...`
   
   This is the most intuitive and user-friendly way. The grammar is referenced 
from the [MySQL grammar 
manual](https://dev.mysql.com/doc/refman/8.0/en/select-into.html). In the 
implementation of Mysql, the exported file will be written to the local file 
system of the MySQL server. Since Doris is a distributed system, query results 
may be produced by any node. So here are two options to export:
   
   1. Write the result to the local file system of a BE. And provide users with 
an http connection for downloading.
   
       This solution is relatively simple. But some problems are involved:
   
       1. File cleanup problem. As a result, the file may be very large and it 
will take up too much BE disk space. And the file cannot exist for a long time, 
it needs a mechanism to clean it regularly.
       2. Connection problems. The user may not have permission or cannot 
directly access a BE node, so the file cannot be downloaded.
   
       For the first problem, it can be satisfied by limiting the size of a 
single file, monitoring disk space, and increasing the file timeout setting and 
regularly cleaning up. The second problem cannot be solved, but it can be used 
as a temporary solution to meet the needs of some users.
   
   2. Write the result to a remote storage system, such as hdfs.
   
       This solution can use the Broker to write files to the remote system. 
This solves the two problems in Option 1.
   
   I plan to implement **Option 1** in the first phase so that users who do not 
have remote storage in time can use this feature. And then only need to make 
less changes, add different write file interface, we can support Option 2.
   
   ### Other system
   
   For `AWS Redshift`, it provides a 
[UPLOAD](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD_command_examples.html)
 command to satisfy this requirement. And it acts same as `SELECT INTO OUTFILE` 
in MySQL.
   
   For `Vertica` and `Snowflake`, they provides shell tools to export the query 
result. 
[Vertica](https://dwgeek.com/how-to-export-vertica-query-results-into-csv.html/),
 
[Snowflake](https://dwgeek.com/export-snowflake-table-data-to-local-csv-format.html/).
 And Snowflake also support a [Copy 
Into](https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html) 
command.
   
   ### Export format
   
   **CSV** is a common way to export the query result. But some columns are in 
binary type, such as BITMAP_UNION and HLL. So that we should find a way to save 
these kind of types.
   **Parquet**, I think, is a good format to do this. And for binary type, we 
can use `BINARY_ARRAY` type in Parquet to save. 
   But Parquet does not support `Largeint`. I will think about it later. But 
for most case, Parquet works.
   
   ## Implementation
   
   ### Syntax
   
   ```
   query_stmt INTO OUFILE "file:///path/to/export" FORMAT AS CSV|PARQUET 
opt_properties
   ```
   
   Example:
   
   ```
   SELECT a, sum(b) FROM tbl1 GROUP BY a LIMIT 10 INTO OUFILE 
"file:///path/to/export" FORMAT AS CSV
   PROPERTIES("prop1" = "value1");
   
   SELECT a FROM tbl1 UNION SELECT b FROM tbl2 INTO OUFILE 
"file:///path/to/export" FORMAT AS CSV;
   ```
   
   ### OutFileSink class
   
   The subclass of DataSink, and it will replace the `ResultSink` in query plan 
when user specified the `OUTFILE` clause.
   
   The `OutFileSink` will receive the query result and write it to user 
specified location.
   
   ### GC
   
   We can reuse the data file cleanup thread to help us clean up expired result 
files.
   
   ### Statement return value
   
   In MySQL, the `SELECT INTO OUTFILE` will return empty result set with number 
of rows exported.
   
   ```
   mysql> select * from tbl1 into outfile "/tmp/out.txt";
   Query OK, 100 row affected (0.06 sec)
   ```
   
   We keep consistent with it.
   
   ### Options
   
   Some options can be specified in `properties` of the statement:
   
   1. `result_size_in_bytes`
   
       Limit the size result file, default is 1GB. This is only limit the file 
size in local file system.
   
   2. `result_file_expiration_time`
   
       The expiration time of the result file in local file system in seconds. 
After expiration. the file will be removed.
   
   3. `file name`
   
       The result file name. The result file name format is 
`name.expiration.csv|parquet`. If file name is not specified, it will use 
`QueryId` and a timestamp instead.
   
       For example:
   
       ```
       my_result_file.86400.csv
   
       10abb9b9a7c4e67-8a6a11ee73dd6278_1589177135.3600.parquet
       ```
   
   ### Remaining problem
   
   1. Whether to support asynchronous export operations
   
      Some queries may take a long time or have a large result set, which may 
cause the request to occupy the connection for a long time.
      We can consider supporting asynchronous commands. And through the 
`Session Variable` to control whether to use asynchronous commands.
   
   ### Scheduling Plan
   
   1. Realize the `OutFileSink` on BE and complete the local file export 
function (in CSV format).
   2. Implement the FE side syntax and query plan, and add support for 
`OutFileSink`.
   3. Support `OutFileSink` to export files to remote storage system through 
Broker.
   4. Support Parquet format export.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to