[ 
https://issues.apache.org/jira/browse/IMPALA-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302967#comment-17302967
 ] 

Wenzhe Zhou edited comment on IMPALA-10564 at 3/18/21, 5:13 PM:
----------------------------------------------------------------

Since we write column by column when writing a row, we have to rewind the table 
writer for partially wrote row if we want to skip a row with invalid column 
data.

Read the code of the table writers for different formats to confirm if we can 
rewind the writer for partially wrote row. It seems that it's not hard for 
Kudu, Text and HBase formats since they buffer row data before writing a row to 
the table. But it's really hard to find a good way for Parquet to skip a 
partially wrote row.

Kudu (KuduTableSink::Send() in kudu-table-sink.cc) create KuduWriteOperation 
object for each row and push the object into a vector after adding all columns. 
We could change the code not to push the KuduWriteOperation object to the 
vector if there is invalid data for one column.

The text table writer (HdfsTextTableWriter::AppendRows() in 
hdfs-text-table-writter.cc) use stringstream to buffer the row data. The 
stringstream itself could not be re-winded, but we could save ending offset of 
last row, and flush the stringstream up to the ending offset of last row if we 
get an invalid column value when writing a new row. Then reset stringstream for 
next row.

Although HBase is known to be a column oriented database (where the column data 
stay together), the data in HBase for a particular row stay together and the 
column data is spread and not together.  Hbase table writer 
(HBaseTableWriter::AppendRows() in hbase_table-writer.cc) create one "Put" 
object for each row, and save a batch of "Put" object in one Java ArrayLIst, 
then write an array of rows into HFile in one function call. We can change the 
code not to add the "Put" to ArrayList after creating it. Instead, add the 
"Put" to ArrayList after all columns are successfully added to "Put". 

It seems decimal data type is not supported for HBase. Did not see any table in 
functional_hbase database with column data type as decimal.

But Parquet is real column oriented database file format. Parquet Table Writer 
code (HdfsParquetTableWriter in hdfs-parquet-table-writer.cc) create a 
BaseColumnWriter object for each column. Each column writer object has its data 
page to buffer column data across rows. If the current data page of 
BaseColumnWriter is full, it will be flushed (finalized by calling 
FinalizeCurrentPage()).  It's really complicated (maybe not feasible) to rewind 
a column writer after its data page has been flushed. Even the column values of 
partial wrote rows are still in data page of each column writer, but some 
values/operations are hard to be reverted. For example, the page min/max stats 
are hard to be reverted. For DictionaryEncoding, the column value is added to 
dictionary only if the value is not in the dictionary. This operation is hard 
to be reverted since we don't know the value in the dictionary is added for the 
current row, or some previous row. So it's hard to rewind table writer to skip 
a partially wrote row. 

Based on above investigation, we will not support row skipping. We will add new 
query option "use_null_for_decimal_errors" as mentioned in Aman's comments.

 


was (Author: wzhou):
Since we write column by column when writing a row, we have to rewind the table 
writer for partially wrote row if we want to skip a row with invalid column 
data.

Read the code of the table writers for different formats to confirm if we can 
rewind the writer for partially wrote row. It seems that it's not hard for 
Kudu, Text and HBase formats since they buffer row data before writing a row to 
the table. But it's really hard to find a good way for Parquet to skip a 
partially wrote row.

Kudu (KuduTableSink::Send() in kudu-table-sink.cc) create KuduWriteOperation 
object for each row and push the object into a vector after adding all columns. 
We could change the code not to push the KuduWriteOperation object to the 
vector if there is invalid data for one column.

The text table writer (HdfsTextTableWriter::AppendRows() in 
hdfs-text-table-writter.cc) use stringstream to buffer the row data. The 
stringstream itself could not be re-winded, but we could save ending offset of 
last row, and flush the stringstream up to the ending offset of last row if we 
get an invalid column value when writing a new row. Then reset stringstream for 
next row.

Although HBase is known to be a column oriented database (where the column data 
stay together), the data in HBase for a particular row stay together and the 
column data is spread and not together.  Hbase table writer 
(HBaseTableWriter::AppendRows() in hbase_table-writer.cc) create one "Put" 
object for each row, and save a batch of "Put" object in one Java ArrayLIst, 
then write an array of rows into HFile in one function call. We can change the 
code not to add the "Put" to ArrayList after creating it. Instead, add the 
"Put" to ArrayList after all columns are successfully added to "Put".

But Parquet is real column oriented database file format. Parquet Table Writer 
code (HdfsParquetTableWriter in hdfs-parquet-table-writer.cc) create a 
BaseColumnWriter object for each column. Each column writer object has its data 
page to buffer column data across rows. If the current data page of 
BaseColumnWriter is full, it will be flushed (finalized by calling 
FinalizeCurrentPage()).  It's really complicated (maybe not feasible) to rewind 
a column writer after its data page has been flushed. Even the column values of 
partial wrote rows are still in data page of each column writer, but some 
values/operations are hard to be reverted. For example, the page min/max stats 
are hard to be reverted. For DictionaryEncoding, the column value is added to 
dictionary only if the value is not in the dictionary. This operation is hard 
to be reverted since we don't know the value in the dictionary is added for the 
current row, or some previous row. So it's hard to rewind table writer to skip 
a partially wrote row. 

Based on above investigation, we will not support row skipping. We will add new 
query option "use_null_for_decimal_errors" as mentioned in Aman's comments.

 

> No error returned when inserting an overflowed value into a decimal column
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-10564
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10564
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend, Frontend
>    Affects Versions: Impala 4.0
>            Reporter: Wenzhe Zhou
>            Assignee: Wenzhe Zhou
>            Priority: Major
>
> When using CTAS statements or INSERT-SELECT statements to insert rows to 
> table with decimal columns, Impala insert NULL for overflowed decimal values, 
> instead of returning error. This issue happens when the data expression for 
> the decimal column in SELECT sub-query consists at least one alias. This 
> issue is similar as IMPALA-6340, but IMPALA-6340 only fixed the issue for the 
> cases with the data expression for the decimal columns as constants so that 
> the overflowed decimal values could be detected by frontend during expression 
> analysis.  If there is alias (variables) in the data expression for the 
> decimal column, Frontend could not evaluate data expression in expression 
> analysis phase. Only backend could evaluate the data expression when backend 
> execute fragment instances for SELECT sub-queries. The log messages showed 
> that the executor detected the decimal overflow error, but somehow it did not 
> propagate the error to the coordinator, hence the error was not returned to 
> the client.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to