Work on HIVE-6009

2016-11-08 Thread naveen mahadevuni
Hi,

I'm starting as a Hive contributor. I would like to work on HIVE-6009
. I found some related
JIRAs HIVE-345 , HIVE-2867
 and HIVE-6040
. Is anyone else working
on similar changes.

Thanks,
Naveen


Re: Load performance with partitioned table

2016-09-19 Thread naveen mahadevuni
hi Franke,

1) We are using 4 indentical AWS machines. 8 vCPUs, 32 GB RAM. 1 TB storage
2) Setting up bloom filters only on two other string columns. Not all of
them.
3) The data is any event data ex: Syslog.
4) Queries usually run on timestamp range with additional predicates on
other columns (mostly equality)
4) We use SNAPPY compression with 256 MB blocks.
5) ORC stripe size is 256MB, HDFS block size is 128 MB
6) The time for first INSERT is 206 seconds and the second one is 302
seconds.

Thanks,
Naveen

On Fri, Sep 16, 2016 at 4:57 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> What is your hardware setup?
> Are the bloom filters necessary on all columns? Usually they make only
> sense for non-numeric columns. Updating bloom filters take time and should
> be avoided where they do not make sense.
> Can you provide an example of the data and the select queries that you
> execute on them?
> Do you use compression on the tables? If so which?
> What are the exact times and data volumes?
>
> > On 15 Sep 2016, at 19:56, naveen mahadevuni <nmahadev...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'm using ORC format for our table storage. The table has a timestamp
> > column(say TS) and 25 other columns. The other ORC properties we are
> using
> > arestorage index and bloom filters. We are loading 100 million records in
> > to this table on a 4-node cluster.
> >
> > Our source table is a text table with CSV format. In the source table
> > timestamp values come as BIGINT. In the INSERT SELECT, we use function
> > "from_unixtime(sourceTable.TS)" to convert the BIGINT values to
> timestamp
> > in the target ORC table. So the first INSERT SELECT in to non-partitioned
> > table looks like this
> >
> > 1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from
> SOURCE.
> >
> > I wanted to test by partitioning the table by date derived from this
> > timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT
> SELECT
> > with dynamic partitioning. The second one is
> >
> > 2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1,
> > col2... to_date(from_unixtime(ts)) as datecol from SOURCE.
> >
> > The load time increased by 50% from 1 to 2. I understand the second
> > statement involves creating many more partition directories and files.
> >
> > Is there anyway we can improve the load time? In the second INSERT
> SELECT,
> > will the result of the expression "from_unixtime(ts)" be reused in
> > "to_date(from_unixtime(ts))"?
> >
> > Thanks,
> > Naveen
>


Load performance with partitioned table

2016-09-15 Thread naveen mahadevuni
Hi,

I'm using ORC format for our table storage. The table has a timestamp
column(say TS) and 25 other columns. The other ORC properties we are using
arestorage index and bloom filters. We are loading 100 million records in
to this table on a 4-node cluster.

Our source table is a text table with CSV format. In the source table
timestamp values come as BIGINT. In the INSERT SELECT, we use function
"from_unixtime(sourceTable.TS)" to convert the BIGINT values to timestamp
in the target ORC table. So the first INSERT SELECT in to non-partitioned
table looks like this

1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from SOURCE.

I wanted to test by partitioning the table by date derived from this
timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT SELECT
with dynamic partitioning. The second one is

2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1,
col2... to_date(from_unixtime(ts)) as datecol from SOURCE.

The load time increased by 50% from 1 to 2. I understand the second
statement involves creating many more partition directories and files.

Is there anyway we can improve the load time? In the second INSERT SELECT,
will the result of the expression "from_unixtime(ts)" be reused in
"to_date(from_unixtime(ts))"?

Thanks,
Naveen


Re: Hive ACID table error

2016-09-15 Thread naveen mahadevuni
Looks like this has been addressed in HIVE-11716
<https://issues.apache.org/jira/browse/HIVE-11716>.

Thanks,
Naveen

On Thu, Sep 15, 2016 at 5:41 AM, Eugene Koifman <ekoif...@hortonworks.com>
wrote:

> There should be a full stack trace somewhere either in the client side log
> on the job logs.
> ³serious error² is usually ORC complaining about some sort of data
> corruption.
>
> On 9/14/16, 11:16 AM, "naveen mahadevuni" <nmahadev...@gmail.com> wrote:
>
> >Hi Wei,
> >I'm using the hive shell.
> >
> >Thanks,
> >Naveen
> >
> >On Wed, Sep 14, 2016 at 8:01 PM, Wei Zheng <wzh...@hortonworks.com>
> wrote:
> >
> >> Hi Naveen,
> >>
> >> Which client are you using? Beeline?
> >>
> >> Thanks,
> >> Wei
> >>
> >> On 9/14/16, 18:25, "naveen mahadevuni" <nmahadev...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I'm using Hive 1,.2. From a non-ACID hive session, I performed the
> >> following operations and Hive reports 'serious problem'.
> >>
> >> CREATE TABLE test5(
> >>   i int,
> >>   j int)
> >> CLUSTERED BY (i) INTO 8 BUCKETS
> >> STORED AS ORC
> >> TBLPROPERTIES ('transactional'='true');
> >>
> >> insert into test5 values(1,2);
> >> insert into test5 values(3,4);
> >>
> >> select * from test5; -- Fails reporting serious problem.
> >>
> >> *-->Failed with exception java.io.IOException:java.lang.
> >> RuntimeException:
> >> serious problem*
> >>
> >> Hive documents "Reading/writing to an ACID table from a non-ACID
> >> session is
> >> not allowed.". Can a better message be reported rather than 'serious
> >> problem'?
> >>
> >> Thanks,
> >> Naveen
> >>
> >>
> >>
>
>


Re: Hive ACID table error

2016-09-14 Thread naveen mahadevuni
Hi Wei,
I'm using the hive shell.

Thanks,
Naveen

On Wed, Sep 14, 2016 at 8:01 PM, Wei Zheng <wzh...@hortonworks.com> wrote:

> Hi Naveen,
>
> Which client are you using? Beeline?
>
> Thanks,
> Wei
>
> On 9/14/16, 18:25, "naveen mahadevuni" <nmahadev...@gmail.com> wrote:
>
> Hi,
>
> I'm using Hive 1,.2. From a non-ACID hive session, I performed the
> following operations and Hive reports 'serious problem'.
>
> CREATE TABLE test5(
>   i int,
>   j int)
> CLUSTERED BY (i) INTO 8 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ('transactional'='true');
>
> insert into test5 values(1,2);
> insert into test5 values(3,4);
>
> select * from test5; -- Fails reporting serious problem.
>
> *-->Failed with exception java.io.IOException:java.lang.
> RuntimeException:
> serious problem*
>
> Hive documents "Reading/writing to an ACID table from a non-ACID
> session is
> not allowed.". Can a better message be reported rather than 'serious
> problem'?
>
> Thanks,
> Naveen
>
>
>


Hive ACID table error

2016-09-14 Thread naveen mahadevuni
Hi,

I'm using Hive 1,.2. From a non-ACID hive session, I performed the
following operations and Hive reports 'serious problem'.

CREATE TABLE test5(
  i int,
  j int)
CLUSTERED BY (i) INTO 8 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

insert into test5 values(1,2);
insert into test5 values(3,4);

select * from test5; -- Fails reporting serious problem.

*-->Failed with exception java.io.IOException:java.lang.RuntimeException:
serious problem*

Hive documents "Reading/writing to an ACID table from a non-ACID session is
not allowed.". Can a better message be reported rather than 'serious
problem'?

Thanks,
Naveen


Re: More than one table created at the same location

2016-08-31 Thread naveen mahadevuni
Hi,

I created external table, copied data files to that location and then count
returns 4. It is ambiguous, can it be documented?

hive> CREATE EXTERNAL TABLE test_ext (col1 INT, col2 INT)
> stored as orc
> LOCATION '/apps/hive/warehouse/ext';
OK
Time taken: 9.875 seconds

hive> select count(*) from test_ext;
Query ID = root_20160831094725_14753b28-68bb-4106-89b7-45052e0cf9a1
Total jobs = 1
Launching Job 1 out of 1

OK
4
Time taken: 30.366 seconds, Fetched: 1 row(s)

hive> select * from test_ext;
OK
1   2
3   4
1   2
3   4
Time taken: 6.478 seconds, Fetched: 4 row(s)


On Wed, Aug 31, 2016 at 2:27 AM, Thejas Nair <thejas.n...@gmail.com> wrote:

> Naveen,
> Can you please verify if you create these tables as external tables the
> results are correct ?
> In case of managed tables, the assumption is that there is a 1:1 mapping
> between tables and the locations and all update to the table are through
> hive. With that assumption, it relies on stats to return results in queries
> like count(*) .
>
>
> On Tue, Aug 30, 2016 at 4:18 AM, Abhishek Somani <
> abhisheksoman...@gmail.com
> > wrote:
>
> > For the 2nd table(after both inserts are over), isn't the return count
> > expected to be 4? In that case, isn't the the bug that the count was
> > returned wrong(maybe from the stats as mentioned) rather the fact that
> > another table was allowed to be created at the same location?
> >
> > I might be very wrong, so pardon my ignorance.
> >
> > On Tue, Aug 30, 2016 at 3:06 AM, Alan Gates <alanfga...@gmail.com>
> wrote:
> >
> > > Note that Hive doesn’t track individual files, just which directory a
> > > table stores its files in.  So we wouldn’t expect this to work.  The
> bug
> > is
> > > more that Hive doesn’t detect that two tables are trying to use the
> same
> > > directory.  I’m not sure we’re anxious to fix this since it would mean
> > when
> > > creating a table Hive would need to search all existing tables to make
> > sure
> > > none of them are using the directory the new table wants to use.
> > >
> > > Alan.
> > >
> > > > On Aug 30, 2016, at 04:17, Sergey Shelukhin <ser...@hortonworks.com>
> > > wrote:
> > > >
> > > > This is a bug, or rather an unexpected usage. I suspect the correct
> > count
> > > > value is coming from statistics.
> > > > Can you file a JIRA?
> > > >
> > > > On 16/8/29, 00:51, "naveen mahadevuni" <nmahadev...@gmail.com>
> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Is the following behavior a bug? I believe at least one part of it
> is
> > a
> > > >> bug. I created two Hive tables at the same location and inserted
> rows
> > in
> > > >> two tables. count(*) returns the correct count for each individual
> > > table,
> > > >> but SELECT * on one tables reads the rows from other table files
> too.
> > > >>
> > > >> CREATE TABLE test1 (col1 INT, col2 INT)
> > > >> stored as orc
> > > >> LOCATION '/apps/hive/warehouse/test1';
> > > >>
> > > >> insert into test1 values(1,2);
> > > >> insert into test1 values(3,4);
> > > >>
> > > >> hive> select count(*) from test1;
> > > >> OK
> > > >> 2
> > > >> Time taken: 0.177 seconds, Fetched: 1 row(s)
> > > >>
> > > >>
> > > >> CREATE TABLE test2 (col1 INT, col2 INT)
> > > >> stored as orc
> > > >> LOCATION '/apps/hive/warehouse/test1';
> > > >>
> > > >> insert into test2 values(1,2);
> > > >> insert into test2 values(3,4);
> > > >>
> > > >> hive> select count(*) from test2;
> > > >> OK
> > > >> 2
> > > >> Time taken: 2.683 seconds, Fetched: 1 row(s)
> > > >>
> > > >> -- SELECT * fetches 4 records where as COUNT(*) above returns count
> of
> > > 2.
> > > >>
> > > >> hive> select * from test2;
> > > >> OK
> > > >> 1   2
> > > >> 3   4
> > > >> 1   2
> > > >> 3   4
> > > >> Time taken: 0.107 seconds, Fetched: 4 row(s)
> > > >> hive> select * from test1;
> > > >> OK
> > > >> 1   2
> > > >> 3   4
> > > >> 1   2
> > > >> 3   4
> > > >> Time taken: 0.054 seconds, Fetched: 4 row(s)
> > > >>
> > > >> Thanks,
> > > >> Naveen
> > > >
> > >
> > >
> >
>


More than one table created at the same location

2016-08-29 Thread naveen mahadevuni
Hi,

Is the following behavior a bug? I believe at least one part of it is a
bug. I created two Hive tables at the same location and inserted rows in
two tables. count(*) returns the correct count for each individual table,
but SELECT * on one tables reads the rows from other table files too.

CREATE TABLE test1 (col1 INT, col2 INT)
stored as orc
LOCATION '/apps/hive/warehouse/test1';

insert into test1 values(1,2);
insert into test1 values(3,4);

hive> select count(*) from test1;
OK
2
Time taken: 0.177 seconds, Fetched: 1 row(s)


CREATE TABLE test2 (col1 INT, col2 INT)
stored as orc
LOCATION '/apps/hive/warehouse/test1';

insert into test2 values(1,2);
insert into test2 values(3,4);

hive> select count(*) from test2;
OK
2
Time taken: 2.683 seconds, Fetched: 1 row(s)

-- SELECT * fetches 4 records where as COUNT(*) above returns count of 2.

hive> select * from test2;
OK
1   2
3   4
1   2
3   4
Time taken: 0.107 seconds, Fetched: 4 row(s)
hive> select * from test1;
OK
1   2
3   4
1   2
3   4
Time taken: 0.054 seconds, Fetched: 4 row(s)

Thanks,
Naveen