Hi,

I created external table, copied data files to that location and then count
returns 4. It is ambiguous, can it be documented?

hive> CREATE EXTERNAL TABLE test_ext (col1 INT, col2 INT)
    > stored as orc
    > LOCATION '/apps/hive/warehouse/ext';
OK
Time taken: 9.875 seconds

hive> select count(*) from test_ext;
Query ID = root_20160831094725_14753b28-68bb-4106-89b7-45052e0cf9a1
Total jobs = 1
Launching Job 1 out of 1
....
OK
4
Time taken: 30.366 seconds, Fetched: 1 row(s)

hive> select * from test_ext;
OK
1       2
3       4
1       2
3       4
Time taken: 6.478 seconds, Fetched: 4 row(s)


On Wed, Aug 31, 2016 at 2:27 AM, Thejas Nair <thejas.n...@gmail.com> wrote:

> Naveen,
> Can you please verify if you create these tables as external tables the
> results are correct ?
> In case of managed tables, the assumption is that there is a 1:1 mapping
> between tables and the locations and all update to the table are through
> hive. With that assumption, it relies on stats to return results in queries
> like count(*) .
>
>
> On Tue, Aug 30, 2016 at 4:18 AM, Abhishek Somani <
> abhisheksoman...@gmail.com
> > wrote:
>
> > For the 2nd table(after both inserts are over), isn't the return count
> > expected to be 4? In that case, isn't the the bug that the count was
> > returned wrong(maybe from the stats as mentioned) rather the fact that
> > another table was allowed to be created at the same location?
> >
> > I might be very wrong, so pardon my ignorance.
> >
> > On Tue, Aug 30, 2016 at 3:06 AM, Alan Gates <alanfga...@gmail.com>
> wrote:
> >
> > > Note that Hive doesn’t track individual files, just which directory a
> > > table stores its files in.  So we wouldn’t expect this to work.  The
> bug
> > is
> > > more that Hive doesn’t detect that two tables are trying to use the
> same
> > > directory.  I’m not sure we’re anxious to fix this since it would mean
> > when
> > > creating a table Hive would need to search all existing tables to make
> > sure
> > > none of them are using the directory the new table wants to use.
> > >
> > > Alan.
> > >
> > > > On Aug 30, 2016, at 04:17, Sergey Shelukhin <ser...@hortonworks.com>
> > > wrote:
> > > >
> > > > This is a bug, or rather an unexpected usage. I suspect the correct
> > count
> > > > value is coming from statistics.
> > > > Can you file a JIRA?
> > > >
> > > > On 16/8/29, 00:51, "naveen mahadevuni" <nmahadev...@gmail.com>
> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Is the following behavior a bug? I believe at least one part of it
> is
> > a
> > > >> bug. I created two Hive tables at the same location and inserted
> rows
> > in
> > > >> two tables. count(*) returns the correct count for each individual
> > > table,
> > > >> but SELECT * on one tables reads the rows from other table files
> too.
> > > >>
> > > >> CREATE TABLE test1 (col1 INT, col2 INT)
> > > >> stored as orc
> > > >> LOCATION '/apps/hive/warehouse/test1';
> > > >>
> > > >> insert into test1 values(1,2);
> > > >> insert into test1 values(3,4);
> > > >>
> > > >> hive> select count(*) from test1;
> > > >> OK
> > > >> 2
> > > >> Time taken: 0.177 seconds, Fetched: 1 row(s)
> > > >>
> > > >>
> > > >> CREATE TABLE test2 (col1 INT, col2 INT)
> > > >> stored as orc
> > > >> LOCATION '/apps/hive/warehouse/test1';
> > > >>
> > > >> insert into test2 values(1,2);
> > > >> insert into test2 values(3,4);
> > > >>
> > > >> hive> select count(*) from test2;
> > > >> OK
> > > >> 2
> > > >> Time taken: 2.683 seconds, Fetched: 1 row(s)
> > > >>
> > > >> -- SELECT * fetches 4 records where as COUNT(*) above returns count
> of
> > > 2.
> > > >>
> > > >> hive> select * from test2;
> > > >> OK
> > > >> 1       2
> > > >> 3       4
> > > >> 1       2
> > > >> 3       4
> > > >> Time taken: 0.107 seconds, Fetched: 4 row(s)
> > > >> hive> select * from test1;
> > > >> OK
> > > >> 1       2
> > > >> 3       4
> > > >> 1       2
> > > >> 3       4
> > > >> Time taken: 0.054 seconds, Fetched: 4 row(s)
> > > >>
> > > >> Thanks,
> > > >> Naveen
> > > >
> > >
> > >
> >
>

Reply via email to