Hi, I created external table, copied data files to that location and then count returns 4. It is ambiguous, can it be documented?
hive> CREATE EXTERNAL TABLE test_ext (col1 INT, col2 INT) > stored as orc > LOCATION '/apps/hive/warehouse/ext'; OK Time taken: 9.875 seconds hive> select count(*) from test_ext; Query ID = root_20160831094725_14753b28-68bb-4106-89b7-45052e0cf9a1 Total jobs = 1 Launching Job 1 out of 1 .... OK 4 Time taken: 30.366 seconds, Fetched: 1 row(s) hive> select * from test_ext; OK 1 2 3 4 1 2 3 4 Time taken: 6.478 seconds, Fetched: 4 row(s) On Wed, Aug 31, 2016 at 2:27 AM, Thejas Nair <thejas.n...@gmail.com> wrote: > Naveen, > Can you please verify if you create these tables as external tables the > results are correct ? > In case of managed tables, the assumption is that there is a 1:1 mapping > between tables and the locations and all update to the table are through > hive. With that assumption, it relies on stats to return results in queries > like count(*) . > > > On Tue, Aug 30, 2016 at 4:18 AM, Abhishek Somani < > abhisheksoman...@gmail.com > > wrote: > > > For the 2nd table(after both inserts are over), isn't the return count > > expected to be 4? In that case, isn't the the bug that the count was > > returned wrong(maybe from the stats as mentioned) rather the fact that > > another table was allowed to be created at the same location? > > > > I might be very wrong, so pardon my ignorance. > > > > On Tue, Aug 30, 2016 at 3:06 AM, Alan Gates <alanfga...@gmail.com> > wrote: > > > > > Note that Hive doesn’t track individual files, just which directory a > > > table stores its files in. So we wouldn’t expect this to work. The > bug > > is > > > more that Hive doesn’t detect that two tables are trying to use the > same > > > directory. I’m not sure we’re anxious to fix this since it would mean > > when > > > creating a table Hive would need to search all existing tables to make > > sure > > > none of them are using the directory the new table wants to use. > > > > > > Alan. > > > > > > > On Aug 30, 2016, at 04:17, Sergey Shelukhin <ser...@hortonworks.com> > > > wrote: > > > > > > > > This is a bug, or rather an unexpected usage. I suspect the correct > > count > > > > value is coming from statistics. > > > > Can you file a JIRA? > > > > > > > > On 16/8/29, 00:51, "naveen mahadevuni" <nmahadev...@gmail.com> > wrote: > > > > > > > >> Hi, > > > >> > > > >> Is the following behavior a bug? I believe at least one part of it > is > > a > > > >> bug. I created two Hive tables at the same location and inserted > rows > > in > > > >> two tables. count(*) returns the correct count for each individual > > > table, > > > >> but SELECT * on one tables reads the rows from other table files > too. > > > >> > > > >> CREATE TABLE test1 (col1 INT, col2 INT) > > > >> stored as orc > > > >> LOCATION '/apps/hive/warehouse/test1'; > > > >> > > > >> insert into test1 values(1,2); > > > >> insert into test1 values(3,4); > > > >> > > > >> hive> select count(*) from test1; > > > >> OK > > > >> 2 > > > >> Time taken: 0.177 seconds, Fetched: 1 row(s) > > > >> > > > >> > > > >> CREATE TABLE test2 (col1 INT, col2 INT) > > > >> stored as orc > > > >> LOCATION '/apps/hive/warehouse/test1'; > > > >> > > > >> insert into test2 values(1,2); > > > >> insert into test2 values(3,4); > > > >> > > > >> hive> select count(*) from test2; > > > >> OK > > > >> 2 > > > >> Time taken: 2.683 seconds, Fetched: 1 row(s) > > > >> > > > >> -- SELECT * fetches 4 records where as COUNT(*) above returns count > of > > > 2. > > > >> > > > >> hive> select * from test2; > > > >> OK > > > >> 1 2 > > > >> 3 4 > > > >> 1 2 > > > >> 3 4 > > > >> Time taken: 0.107 seconds, Fetched: 4 row(s) > > > >> hive> select * from test1; > > > >> OK > > > >> 1 2 > > > >> 3 4 > > > >> 1 2 > > > >> 3 4 > > > >> Time taken: 0.054 seconds, Fetched: 4 row(s) > > > >> > > > >> Thanks, > > > >> Naveen > > > > > > > > > > > > >