[
https://issues.apache.org/jira/browse/TAJO-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jihoon Son resolved TAJO-1315.
------------------------------
Resolution: Not a Problem
This problem is due to the broken input data.
> Invalid results are returned when a source table consists of multiple csv
> files
> -------------------------------------------------------------------------------
>
> Key: TAJO-1315
> URL: https://issues.apache.org/jira/browse/TAJO-1315
> Project: Tajo
> Issue Type: Bug
> Components: storage
> Reporter: Jihoon Son
> Priority: Critical
> Fix For: 0.10
>
>
> See the title.
> Here are some examples related to this bug.
> {noformat}
> default> \dfs -ls /customer.tbl
> Found 19 items
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000001
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000002
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000003
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000004
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000005
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000006
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000007
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000008
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000009
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000010
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000011
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000012
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000013
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000014
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000015
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:25
> /customer.tbl/000016
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:26
> /customer.tbl/000017
> -rw-r--r-- 3 hadoop supergroup 134217728 2015-01-26 20:26
> /customer.tbl/000018
> -rw-r--r-- 3 hadoop supergroup 47571167 2015-01-26 20:26
> /customer.tbl/000019
> default> create external table test (C_CUSTKEY bigint, C_NAME text, C_ADDRESS
> text, C_NATIONKEY bigint, C_PHONE text, C_ACCTBAL double, C_MKTSEGMENT text,
> C_COMMENT text) using csv with ('csvfile.delimiter'='|') location
> 'hdfs://192.168.0.1:7020/customer.tbl';
> OK
> default> \d test
> table name: tpch_swift.test
> table path: hdfs://192.168.0.1:7020/customer.tbl
> store type: CSV
> number of rows: unknown
> volume: 2.5 GB
> Options:
> 'text.delimiter'='|'
> schema:
> c_custkey INT8
> c_name TEXT
> c_address TEXT
> c_nationkey INT8
> c_phone TEXT
> c_acctbal FLOAT8
> c_mktsegment TEXT
> c_comment TEXT
> default> select count(*) from test;
> ?count
> -------------------------------
> 15000017
> (1 rows, 3.2 sec, 9 B selected)
> {noformat}
> As you can see, the expected result is 15000000, but the real result was
> 15000017.
> So, I investigated error tuples as follows.
> {noformat}
> default> select c_custkey, count(*) as cnt from customer2 group by c_custkey
> having cnt > 1;
> c_custkey, cnt
> -------------------------------
> , 14
> 114575, 2
> 14711665, 2
> 34, 2
> (4 rows, 16.681 sec, 29 B selected)
> default> select * from customer2 where c_custkey is null or c_custkey =
> 114575 or c_custkey = 14711665 or c_custkey = 34;
> c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal,
> c_mktsegment, c_comment
> -------------------------------
> 34, Customer#000000034, Q6G9wZ6dnczmtOx509xgE,M2KV, 15, 25-344-968-5422,
> 8589.7, HOUSEHOLD, nder against the even, pending accounts. even
> 114575, Customer#000114575, xqLzTzY0,QvqwlSPI8OLxjRQ4s2W7pkSWwK, 16,
> 26-303-921-2836, 6663.68, AUTOMOBILE, le fluffily final deposits.
> furiously regu
> , 21, 31-264-911-5053, , HOUSEHOLD, 0.0, ,
> , IexCQQNp7tsMK63QKrGw37H3JJXGPaXBk, 18, , 4313.01, 0.0, the never
> pending accounts. slyly fluffy pinto beans run fluffily. furiously ,
> , , , , , , ,
> , 152.95, MACHINERY, , , , ,
> , t the ironic, close accounts are careful, , , , , ,
> , 20, 30-481-475-8163, , AUTOMOBILE, 0.0, ,
> , , , , , , ,
> , MACHINERY, ts use slyly even dependencie, , , , ,
> , , , , , , ,
> , 24, 34-639-456-9692, , FURNITURE, 0.0, ,
> , , , , , , ,
> 114575, , , , , , ,
> 34, Customer#011457534, wFUkCU67OxuxvfQeSdvSMDtMB7DWt7jiw, 2,
> 12-145-168-8442, 145.78, MACHINERY, ic accounts. ironic, final ideas sleep
> qu
> , XPP8pRDTDs4MFMP7SSlv, 17, , 5437.09, 0.0, egular requests cajole
> slyly after the ,
> , blithely along the regular, daring deposits. ironic acco, , , , , ,
> , 12, 22-656-233-3821, , HOUSEHOLD, 0.0, ,
> 14711665, Customer#0, , , , , ,
> 14711665, QKTarsTkX7, 19, , 7017.62, 0.0, ly after the carefully ironic
> theodolites. pending requests are slyly across the deposits. even accounts
> boost. fina,
> (20 rows, 8.964 sec, 1.2 KiB selected)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)