Re: Hive Pulsar Integration

2019-04-13 Thread Jörn Franke
I think you need to develop a custom hiveserde + custom Hadoopinputformat + 
custom Hiveoutputformat

> Am 12.04.2019 um 17:35 schrieb 李鹏辉gmail :
> 
> Hi guys,
> 
> I’m working on integration of hive and pulsar recently. But now i have 
> encountered some problems and hope to get help here.
> 
> First of all, i simply describe the motivation.
> 
> Pulsar can be used as infinite streams for keeping both historic data and 
> streaming data, So we want to use pulsar as a storage extension for hive.
> In this way, hive can read the data in pulsar naturally, and can also write 
> data into pulsar.
> We will benefit from the same data that provides both interactive query and 
> streaming capabilities.
> 
> As an improvement, support data partitioning can make the query more 
> efficient(e.g. partition by date or any other field). 
> 
> But
> 
> - how to get hive table partition definition? 
> - While user inert data to hive table, how to get partition the data should 
> be store? 
> - While use select data from hive table, how to determine data is in that 
> partition?
> 
> If hive already expose some mechanism to support, please show me how to use 
> it.
> 
> Best regards
> 
> Penghui
> Beijing, China
> 
> 
> 


Re: How to Load Data From a CSV to a parquet table

2018-03-01 Thread Jörn Franke
You have defined a parquet only table. It interprets your CSV file as parquet. 
You can for instance define 2 tables:

* one external for the CSV file
* one table for the parquet file

Afterwards you select from the first table and insert in the second table. 

> On 1. Mar 2018, at 08:31, Anubhav Tarar  wrote:
> 
> Hi i m trying to load data from a csv file into parquet in hive but got
> this exception
> 
> hive> create table if not exists REGION( R_NAME string, R_REGIONKEY string,
> R_COMMENT string ) stored as parquet;
> OK
> Time taken: 0.414 seconds
> hive> load data local inpath
> 'file:///home/anubhav/Downloads/dbgen/region.tbl' into table region;
> Loading data to table default.region
> OK
> Time taken: 1.011 seconds
> hive> select * from region;
> OK
> Failed with exception java.io.IOException:java.lang.RuntimeException:
> hdfs://localhost:54311/user/hive/warehouse/region/region.tbl is not a
> Parquet file. expected magic number at tail [80, 65, 82, 49] but found
> [115, 108, 124, 10]
> Time taken: 0.108 seconds
> 
> can anyone help?hive version is 2.1
> 
> -- 
> Thanks and Regards
> 
> *   Anubhav Tarar *
> 
> 
> * Software Consultant*
>  *Knoldus Software LLP    *
>   LinkedIn  Twitter
> fb 
>  mob : 8588915184


Re: Hive Custom Inputformat mapred.* vs. mapreduce.*

2017-09-13 Thread Jörn Franke
Ok mapreduce is the engine, but I am talking about the file format. Even with 
Tez or Spark as an engine the fileformat will always be built on mapped.* 
and/or mapreduce.* 
Aside from hive this is also the case in Spark and Flink to leverage the 
benefits to access different storages (S3, HDFS etc) and file formats 
transparently.

> On 13. Sep 2017, at 20:53, Alan Gates <alanfga...@gmail.com> wrote:
> 
> I’m not aware of any plans in Hive to do any more work that uses Map Reduce
> as the execution engine, so I expect Hive will continue to use mapred.
> 
> Alan.
> 
>> On Wed, Sep 13, 2017 at 4:25 AM, Jörn Franke <zuinn...@gmail.com> wrote:
>> 
>> Dear all,
>> 
>> I have developed several custom input formats (e.g. for the Bitcoin
>> blockchain) including a HiveSerde, which are open source.
>> I plan to develop for my HadoopOffice inputformat also a HiveSerde, but I
>> wonder if I should continue to use mapred.* apis or if i should use
>> mapreduce.*
>> 
>> My inputformats support both APIs, but it seems that Hive is one of the
>> last (please correct me here) to use the mapred.* API
>> 
>> Personally, i have no preference, since I support both.
>> 
>> Thank you.
>> 
>> All the best
>> 


Hive Custom Inputformat mapred.* vs. mapreduce.*

2017-09-13 Thread Jörn Franke
Dear all,

I have developed several custom input formats (e.g. for the Bitcoin
blockchain) including a HiveSerde, which are open source.
I plan to develop for my HadoopOffice inputformat also a HiveSerde, but I
wonder if I should continue to use mapred.* apis or if i should use
mapreduce.*

My inputformats support both APIs, but it seems that Hive is one of the
last (please correct me here) to use the mapred.* API

Personally, i have no preference, since I support both.

Thank you.

All the best


Re: TBLPROPERTIES appears to be ignored by custom inputformats

2016-12-19 Thread Jörn Franke
I assume it is related to lazy evaluation by the serde but this would require 
investigation of the source code and log files. 
If there is no exception in the log files then you forgot to log them (bad!). 
Use sonarqube or similar to check your source code related to those kind of 
mistakes - saves debug and fixing time.
At the same time, your inputformat should have either default values for 
configuration or where not possible throw an exception if expected 
configuration is  it there.

> On 19 Dec 2016, at 07:53, Chris Teoh <chris.t...@gmail.com> wrote:
> 
> Thanks Jorn.
> 
> I don't understand how my select * is correctly reading my table property
> then if I'm just using default serde.
>> On Mon., 19 Dec. 2016 at 5:36 pm, Jörn Franke <jornfra...@gmail.com> wrote:
>> 
>> You have to write a custom hiveserde format to pass tblproperties as
>> inputformat properties, but check the source code of the serde you used.
>> 
>>> On 19 Dec 2016, at 07:22, Chris Teoh <chris.t...@gmail.com> wrote:
>>> 
>>> rows.
>> 


Re: TBLPROPERTIES appears to be ignored by custom inputformats

2016-12-18 Thread Jörn Franke
You have to write a custom hiveserde format to pass tblproperties as 
inputformat properties, but check the source code of the serde you used.

> On 19 Dec 2016, at 07:22, Chris Teoh  wrote:
> 
> rows.


Re: TBLPROPERTIES appears to be ignored by custom inputformats

2016-12-18 Thread Jörn Franke
What is the create table statement? Do you parse the tblproperties in the 
HiveSerde? Do you have exceptions in the log?

> On 19 Dec 2016, at 07:02, Chris Teoh  wrote:
> 
> Hi there,
> 
> Can anyone confirm whether TBLPROPERTIES in DDLs are ignored by custom
> inputformats in the context of a UDAF?
> 
> I've written a custom input format and it works with a SELECT * but when I
> do anything more like SELECT count(*) it returns 0.
> 
> INSERT INTO  SELECT * FROM  doesn't
> appear to insert anything into the table.
> 
> I found I had to use the "set" commands to make things work rather than use
> DDL. This doesn't appear to make sense if the setting is specific to the
> table and would cause problems if I was working with more than one table
> and needed different values for the same setting.
> 
> Kind regards
> Chris


Re: Load performance with partitioned table

2016-09-15 Thread Jörn Franke
What is your hardware setup?
Are the bloom filters necessary on all columns? Usually they make only sense 
for non-numeric columns. Updating bloom filters take time and should be avoided 
where they do not make sense.
Can you provide an example of the data and the select queries that you execute 
on them?
Do you use compression on the tables? If so which?
What are the exact times and data volumes?

> On 15 Sep 2016, at 19:56, naveen mahadevuni  wrote:
> 
> Hi,
> 
> I'm using ORC format for our table storage. The table has a timestamp
> column(say TS) and 25 other columns. The other ORC properties we are using
> arestorage index and bloom filters. We are loading 100 million records in
> to this table on a 4-node cluster.
> 
> Our source table is a text table with CSV format. In the source table
> timestamp values come as BIGINT. In the INSERT SELECT, we use function
> "from_unixtime(sourceTable.TS)" to convert the BIGINT values to timestamp
> in the target ORC table. So the first INSERT SELECT in to non-partitioned
> table looks like this
> 
> 1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from SOURCE.
> 
> I wanted to test by partitioning the table by date derived from this
> timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT SELECT
> with dynamic partitioning. The second one is
> 
> 2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1,
> col2... to_date(from_unixtime(ts)) as datecol from SOURCE.
> 
> The load time increased by 50% from 1 to 2. I understand the second
> statement involves creating many more partition directories and files.
> 
> Is there anyway we can improve the load time? In the second INSERT SELECT,
> will the result of the expression "from_unixtime(ts)" be reused in
> "to_date(from_unixtime(ts))"?
> 
> Thanks,
> Naveen


Analyzing Bitcoin blockchain data with Hive

2016-04-29 Thread Jörn Franke
Dear all,

I prepared a small Serde to analyze Bitcoin blockchain data with Hive:
https://snippetessay.wordpress.com/2016/04/28/hive-bitcoin-analytics-on-blockchain-data-with-sql/

There are some example queries, but I will add some in the future.
Additionally, more unit tests will be added.

Let me know if this is useful for you and of course please report bugs ;)

Thank you.

Cheers


Re: Reviews & commits (RTC/CTR), contributions, bylaws

2016-04-11 Thread Jörn Franke
Hi Lars,

I think this is a valid concern and your proposal sounds good to me. 

Best regards

> On 11 Apr 2016, at 15:38, Lars Francke  wrote:
> 
> Hi,
> 
> I've been a long-time contributor to Hive (5 or so years) and have been
> voted in as a committer and I'm very grateful for that. I also understand
> that my situation is different than most or lots of committers as I'm not
> working for one of the big companies (Facebook, Cloudera, Hortonworks etc.)
> where you can just ask someone sitting next to you to do a review.
> 
> I'd really like to contribute more than I do currently but the process of
> getting patches in is painful for me (and other 'outside' contributors) as
> it is hard to get reviews & things committed. The nature of most of my
> patches is very minor[1] (fixing typos, checkstyle issues etc.) and I
> understand that these are not the most interesting patches to review and
> are easy to miss. I don't blame anyone for this situation as I totally
> understand it and have been on the other side of this for other projects.
> 
> Is there anything we can do to make it easier for me and others like me to
> contribute here? I absolutely see the value in having "cleaner" code and
> when done in small batches it's usually not very disruptive either.
> 
> The bylaws currently require a +1 from a committer who has not authored the
> patch. Knox for example has a different policy [2] where they distinguish
> between major features and minor things which can be committed freely.
> 
> Hive could adopt something similar or like a middle ground. These are just
> two suggestions:
> 
> 1) Allow minor changes (up to the committers discretion) without requiring
> an extra +1
> 2) Allow minor changes (up to the committers discretion) with Lazy approval
> (i.e. wait 24 hours)
> 
> Sorry for the long rant but I'd love some feedback on this and am looking
> forward to contributing more in the future.
> 
> Cheers,
> Lars
> 
> [1] e.g. 
> [2] 


Re: analyse command not working on decimal(38,0) datatype

2016-04-06 Thread Jörn Franke
Please provide exact log messages , create table statements, insert statements

> On 06 Apr 2016, at 12:05, Ashim Sinha  wrote:
> 
> Hi Team
> Need help for the issue
> Steps followed
> table created
> Loaded the data of lenght 38 in decimal type
> Analyse table - for columns gives error like zero Lenghth biginteger


Re: Run hive in debug mode

2015-09-06 Thread Jörn Franke
There must be some log4j configuration file in the hive configuration
folder where you can do this.

Le lun. 7 sept. 2015 à 5:09, Chetna C  a écrit :

> Hi All,
>I am new to hive community, and recently started going through code.
> There are some points when I would like to see the execution flow by
> enabling debug mode. I am not sure how to achieve this.
>   The way I am doing is inside hive-cli I run
> hive.server2.logging.operation.level=VERBOSE. But with above, log file
> contains only INFO statements. I also tried compiling code with
> -Djavac.debug=on, that didn't work for me.
>   Please help if you know how to enable debug mode.
>
>
> Thanks,
> Chetna Chaudhari.
>


Re: HIVE:1.2, Query taking huge time

2015-08-20 Thread Jörn Franke
Additionally, although it is a PoC you should have a realistic data model.
Furthermore, following good data modeling practices should be taken into
account. Joining on a double is not one of them. It should be int.
Furthermore, double is a type that is in most scenarios rarely used. In the
business world you have usually something like decimal(precision,scale)
where you exactly define precision  and scale (usually x,2 for all stuff
related to money). The reason is that rounding needs to be consistent
across all analytics applications.
Also you may then partition of course the table.

Le jeu. 20 août 2015 à 15:46, Xuefu Zhang xzh...@cloudera.com a écrit :

 Please check out HIVE-11502. For your poc, you can simply get around using
 other data types instead of double.

 On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal nishant@gmail.com
 wrote:

  Thanks for the reply Noam. I have already tried the later point of
  dividing the query. But the challenge comes during the joining of the
 table.
 
 
  Thanks and Regards
  Nishant Aggarwal, PMP
  Cell No:- +91 99588 94305
 
 
  On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson noam.has...@kenshoo.com
  wrote:
 
  Hi,
 
  Have you look at counters in Hadoop side? It's possible you are dealing
  with a bad join which causes multiplication of items, if you see huge
  number of record input/output in map/reduce phase and keeps increasing
  that's probably the case.
 
  Another thing I would try is to divide the job into several different
  smaller queries, for example start with filter only, after than join
 and so
  on.
 
  Noam.
 
  On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal 
 nishant@gmail.com
   wrote:
 
  Dear Hive Users,
 
  I am in process of running over a poc to one of my customer
  demonstrating the huge performance benefits of Hadoop BigData using
 Hive.
 
  Following is the problem statement i am stuck with.
 
  I have generate a large table with 28 columns( all are double). Table
  size on disk is 70GB (i ultimately created compressed table using ORC
  format to save disk space bringing down the table size to  1GB) with
 more
  than 450Million records.
 
  In order to demonstrate a complex use case i joined this table with
  itself. Following are the queries i have used to create table and  join
  query i am using.
 
  *Create Table and Loading Data, Hive parameters settigs:*
  set hive.vectorized.execution.enabled = true;
  set hive.vectorized.execution.reduce.enabled = true;
  set mapred.max.split.size=1;
  set mapred.min.split.size=100;
  set hive.auto.convert.join=false;
  set hive.enforce.sorting=true;
  set hive.enforce.bucketing=true;
  set hive.exec.dynamic.partition=true;
  set hive.exec.dynamic.partition.mode=nonstrict;
  set mapreduce.reduce.input.limit=-1;
  set hive.exec.parallel = true;
 
  CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
  double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
  double,col10 double,col11 double,col12 double,col13 double,col14
  double,col15 double,col16 double,col17 double,col18 double,col19
  double,col20 double,col21 double,col22 double,col23 double,col24
  double,col25 double,col26 double,col27 double,col28 double)
  clustered by (col1) sorted by (col1) into 240 buckets
  STORED AS ORC tblproperties (orc.compress=SNAPPY);
 
  from huge_numeric_table insert overwrite table huge_numeric_table_orc2
  select * sort by col1;
 
 
  *JOIN QUERY:*
 
  select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as
 AVG5
  from huge_numeric_table_orc2 t1 left outer join
 huge_numeric_table_orc2 t2
  on t1.col1=t2.col1 where (t1.col1)  34.11 and (t2.col1) 10.12
 
 
  *The problem is that this query gets stuck at reducers :80-85%. and
 goes
  in a loop and never finishes. *
 
  Version of Hive is 1.2.
 
  Please help.
 
 
  Thanks and Regards
  Nishant Aggarwal, PMP
  Cell No:- +91 99588 94305
 
 
 
  This e-mail, as well as any attached document, may contain material
 which
  is confidential and privileged and may include trademark, copyright and
  other intellectual property rights that are proprietary to Kenshoo Ltd,
   its subsidiaries or affiliates (Kenshoo). This e-mail and its
  attachments may be read, copied and used only by the addressee for the
  purpose(s) for which it was disclosed herein. If you have received it in
  error, please destroy the message and any attachment, and contact us
  immediately. If you are not the intended recipient, be aware that any
  review, reliance, disclosure, copying, distribution or use of the
 contents
  of this message without Kenshoo's express permission is strictly
 prohibited.
 
 
 



Re: [jira] [Created] (HIVE-11312) ORC format: where clause with CHAR data type not returning any rows

2015-07-19 Thread Jörn Franke
Is this also for varchar ?

Le lun. 20 juil. 2015 à 1:14, Thomas Friedrich (JIRA) j...@apache.org a
écrit :

 Thomas Friedrich created HIVE-11312:
 ---

  Summary: ORC format: where clause with CHAR data type not
 returning any rows
  Key: HIVE-11312
  URL: https://issues.apache.org/jira/browse/HIVE-11312
  Project: Hive
   Issue Type: Bug
   Components: Query Processor
 Affects Versions: 1.2.0, 1.2.1
 Reporter: Thomas Friedrich
 Assignee: Thomas Friedrich


 Test case:
 Setup:
 create table orc_test( col1 string, col2 char(10)) stored as orc
 tblproperties (orc.compress=NONE);
 insert into orc_test values ('val1', '1');
 Query:
 select * from orc_test where col2='1';

 Query returns no row.

 Problem is introduced with HIVE-10286, class RecordReaderImpl.java, method
 evaluatePredicateRange.
 Old code:
 - Object baseObj = predicate.getLiteral(PredicateLeaf.FileFormat.ORC);
 - Object minValue = getConvertedStatsObj(min, baseObj);
 - Object maxValue = getConvertedStatsObj(max, baseObj);
 - Object predObj = getBaseObjectForComparison(baseObj, minValue);

 New code:
 + Object baseObj = predicate.getLiteral();
 + Object minValue = getBaseObjectForComparison(predicate.getType(), min);
 + Object maxValue = getBaseObjectForComparison(predicate.getType(), max);
 + Object predObj = getBaseObjectForComparison(predicate.getType(),
 baseObj);

 The values for min and max are of type String which contain as many
 characters as the CHAR column indicated. For example if the type is
 CHAR(10), and the row has value 1, the value of String min is 1 ;

 Before Hive 1.2, the method getConvertedStatsObj would call
 StringUtils.stripEnd(statsObj.toString(), null); which would remove the
 trailing spaces from min and max. Later in the compareToRange method, it
 was able to compare 1 with 1.

 In Hive 1.2 with the use getBaseObjectForComparison method, it simply
 returns obj.String if the data type is String, which means minValue and
 maxValue are still 1 .
 As a result, the compareToRange method will return a wrong value
 (1.compareTo(1 )  -9 instead of 0.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Re: run tests failed when building hive query language on hadoop 2.7.0

2015-05-25 Thread Jörn Franke
Hi 韦,
You must be a little bit more precise. What is the error message? What is
your setup (OS, JDK version..)  ?

Thank you.

Best regards
Le 25 mai 2015 09:11, 煜 韦 yu20...@hotmail.com a écrit :

 Hi,When I'm building hive 1.2.0 on hadoop 2.7.0. During building hive
 query language and running tests, it failed.What should I do to fix this
 problem?
 Thanks,Jared