Re: Can hive bear high throughput streaming data ingest?

2020-03-20 Thread Jörn Franke
Why don’t you write them directly on local storage and then write them all to HDFS? Then you can create an external table in Hive on them and do analyses > Am 20.03.2020 um 08:30 schrieb "wangl...@geekplus.com.cn" > : > >  >

Re: Less than(<) & Greater than(>) condition failing on string column but min/max is working

2019-08-20 Thread Jörn Franke
You could just move the ints outside the Map. Alternatively you can convert the String to Int : cast (strcolumn to int) See:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions > Am 12.08.2019 um 21:41 schrieb Anup Tiwari : > > Hi All, >

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Jörn Franke
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share some code? > Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade : > > Hi. > I am using Spark 2.3.2 and Hive 3.1.0. > Even if i use parquet files the result would be same, because after all > sparkSQL

Re: Converting Hive Column from Varchar to String

2019-07-18 Thread Jörn Franke
You have to create a new table with this column as varchar and do a select insert from the old table. > Am 18.07.2019 um 01:14 schrieb William Shen : > > Hi all, > > I assumed that it should be compatible to convert column type varchar to > string, however, after running ALTER TABLE table

Re: Out Of Memory Error

2019-01-09 Thread Jörn Franke
Which hive version and engine? If it is tez then you can also try mr as an engine set hive.execution.engine=mr that will use less memory. Check also the max heap space configuration on the nodes . Maybe you have physically 16 gb memory but the Java process takes only 4 or so memory. Maybe

Re: Dynamic partition pruning

2018-12-19 Thread Jörn Franke
Can you please provide us more details: Number of rows in each table and per partition, the table structure, hive version, table format, is table sorted or partitioned on dt? Why don’t you use a join, potentially with a mapjoin hint? > Am 19.12.2018 um 09:02 schrieb Prabhakar Reddy : > >

Re: [feature request] auto-increment field in Hive

2018-09-16 Thread Jörn Franke
done on it in a long time. A > simple UDF isn't capable of providing true unique sequence support. > > Thanks > Shawn > > -Original Message- > From: Jörn Franke > Sent: Saturday, September 15, 2018 6:09 AM > To: user@hive.apache.org > Subject: Re: [featur

Re: [feature request] auto-increment field in Hive

2018-09-15 Thread Jörn Franke
If you really need it then you can write an UDF for it. > On 15. Sep 2018, at 11:54, Nicolas Paris wrote: > > Hi > > Hive does not provide auto-increment columns (=sequences). Is there any > chance that feature will be provided in the future ? > > This is one of the highest limitation in

Re: Improve performance of Analyze table compute statistics

2018-08-26 Thread Jörn Franke
You can partition it and only compute statistics for new partitions... > On 26. Aug 2018, at 12:43, Prabhakar Reddy wrote: > > Hello, > > Are there any properties that I can set to improve the performance of Analyze > table compute statistics statement.My data sits in s3 and I see it's taking

Re: Enabling Snappy compression on Parquet

2018-08-22 Thread Jörn Franke
No parquet and orc have internal compression which must be used over the external compression that you are referring to. Internal compression can be decompressed in parallel which is significantly faster. Internally parquet supports only snappy, gzip,lzo, brotli (2.4.), lz4 (2.4), zstd (2.4).

Re: 回复: Does Hive 3.0 only works with hadoop3.x.y?

2018-07-22 Thread Jörn Franke
Hadoop 3.0 brings anyway some interesting benefits such as reduced storage needs (you dont need to replicate anymore 3 times for reliability reasons), so that may be convincing. > On 22. Jul 2018, at 08:28, 彭鱼宴 <461292...@qq.com> wrote: > > Hi Tanvi, > > Thanks! I will check that and have a

Re: Error Starting hive thrift server, hive 3 on Hadoop 3.1

2018-07-03 Thread Jörn Franke
It would be good if you can document this on the Hive wiki so that other users know it. On the other hand there is Apache Bigtop which tests integration of various Big Data components - but it is complicated. Behind a big data distribution there is a lot of effort. > On 3. Jul 2018, at

Re: Which version of Hive can hanle creating XML table?

2018-06-09 Thread Jörn Franke
> damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 9

Re: Which version of Hive can hanle creating XML table?

2018-06-09 Thread Jörn Franke
d. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >>> On 8 June 2018 at 21:56, Jörn Franke wrote: >>> Oha i see now Serde is a deprecated Interface , if i am not wro

Re: Which version of Hive can hanle creating XML table?

2018-06-08 Thread Jörn Franke
A-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.properties > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at you

Re: Which version of Hive can hanle creating XML table?

2018-06-08 Thread Jörn Franke
Can you get the log files and start Hive with more detailled logs? In could be that not all libraries are loaded (i don’t remember anymore but I think this one needs more , I can look next week in my docs) or that it does not support maps (not sure). You can try first with a more simpler

Re: What does the ORC SERDE do

2018-05-13 Thread Jörn Franke
d point me in > the direction of anything I’ve missed. > > Thanks, > > Elliot. > >> On Sun, 13 May 2018 at 15:42, Jörn Franke <jornfra...@gmail.com> wrote: >> In detail you can check the source code, but a Serde needs to translate an >> object to a Hiv

Re: What does the ORC SERDE do

2018-05-13 Thread Jörn Franke
May 2018, at 17:08, 侯宗田 <zongtian...@icloud.com> wrote: > > Thank you, it makes the concept clearer to me. I think I need to look up the > source code for some details. >> 在 2018年5月13日,下午10:42,Jörn Franke <jornfra...@gmail.com> 写道: >> >> In detail you ca

Re: What does the ORC SERDE do

2018-05-13 Thread Jörn Franke
In detail you can check the source code, but a Serde needs to translate an object to a Hive object and vice versa. Usually this is very simple (simply passing the object or create A HiveDecimal etc). It also provides an ObjectInspector that basically describes an object in more detail (eg to be

Re: Need to read JSON File

2018-04-22 Thread Jörn Franke
Check the Json serde: https://cwiki.apache.org/confluence/display/Hive/SerDe > On 22. Apr 2018, at 09:09, Mahender Sarangam > wrote: > > Hi, > > we have to read Gz compressed JSON File from Source System. I see they are 3 > different ways of reading JSON data.

Re: Business Rules Engine for Hive

2018-04-16 Thread Jörn Franke
, I was thinking to give a custom UI. > > Next read from UI data and build UDFs using the rules defined outside the UDF. > > 1 UDF per data object. > > Not sure these are just thoughts. > >> On Mon, Apr 16, 2018 at 1:40 PM Jörn Franke <jornfra...@gmail.com> wrote

Re: Business Rules Engine for Hive

2018-04-16 Thread Jörn Franke
I would not use Drools with Spark, it does not scale to the distributed setting. You could translate the rules to hive queries but this would not be exactly the same thing. > On 16. Apr 2018, at 17:59, Joel D wrote: > > Hi, > > Any suggestions on how to implement

Re: ODBC-hiveserver2 question

2018-02-24 Thread Jörn Franke
HDFS support depends on the version. A long time it was not supported. > On 23. Feb 2018, at 21:08, Andy Srine wrote: > > Team, > > Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;) > supported in hiveserver2 via an ODBC connection? > > Some

Re: ODBC-hiveserver2 question

2018-02-23 Thread Jörn Franke
Add jar works only with local files on the Hive server. > On 23. Feb 2018, at 21:08, Andy Srine wrote: > > Team, > > Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;) > supported in hiveserver2 via an ODBC connection? > > Some relevant points: >

Re: Question on accessing LLAP as data cache from external containers

2018-01-29 Thread Jörn Franke
Are you looking for sth like this: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html To answer your original question: why not implement the whole job in Hive? Or orchestrate using oozie some parts in mr and some in Huve. > On 30. Jan 2018,

Re: HIVE Parquet column names issue

2018-01-26 Thread Jörn Franke
Drop the old parquet table before and then create it with explicit statements. The above statement keeps using the old parquet table if it existed > On 26. Jan 2018, at 17:35, Brandon Cooke wrote: > > Hi Prasad, > > I actually have tried this and I had that same

Re: Does Hive SQL support reading data without locking?

2018-01-01 Thread Jörn Franke
How do you import data ? Bulk import? What about using partitions ( or is the data too small for daily partitions?) > On 2. Jan 2018, at 04:59, l...@china-inv.cn wrote: > > Hi, All, > > We are using Hive to persist our data and we run cron jobs to import new data > into Hive daily. > > At

Re: For Apache Hive HS2 , what is the largest heap size setting that works well?

2017-11-28 Thread Jörn Franke
I also recommend it you will have also performance improvements with JDK8 in general (use the latest version). Keep also in mind that more and more big data libraries etc will drop JDK7 support soon (Aside that JDK7 is anyway not maintained anymore). > On 29. Nov 2017, at 01:31, Johannes

Re: Issues with hive storage based authorization

2017-11-15 Thread Jörn Franke
orization. So not sure how hive > storage base authorization will provided additional security. Definitely I am > missing something. > > Please suggest. > > Thanks, > Vijay > >> On Thu, Nov 9, 2017 at 1:55 PM, Jörn Franke <jornfra...@gmail.com> wrote: >>

Re: Issues with hive storage based authorization

2017-11-09 Thread Jörn Franke
Then you need to kerberize it to support what you want > On 9. Nov 2017, at 09:18, Vijay Toshniwal <vijay.toshni...@gmail.com> wrote: > > No its not. > > Thanks, > Vijay > >> On Thu, Nov 9, 2017 at 1:09 PM, Jörn Franke <jornfra...@gmail.com> wrot

Re: Issues with hive storage based authorization

2017-11-08 Thread Jörn Franke
Is your Hadoop cluster kerberized? > On 9. Nov 2017, at 06:57, Vijay Toshniwal wrote: > > Hi Team, > > > > I am facing issues while configuring hive storage based authorization. I > followed the steps mentioned in >

Re: Parameterized views

2017-10-02 Thread Jörn Franke
Yes it is better to push columns to the view so that users can filter on them. Alternatively you can use hpl/sql on Hive. I think there are very few (if any) use cases to support parametrized views . > On 2. Oct 2017, at 16:12, Elliot West wrote: > > Hello, > > Does any

Re: hive on spark - why is it so hard?

2017-10-02 Thread Jörn Franke
You should try with TEZ+LLAP. Additionally you will need to compare different configurations. Finally just any comparison is meaningless. You should use queries, data and file formats that your users are using later. > On 2. Oct 2017, at 03:06, Stephen Sprague wrote: > >

Re: Hive - Avro - Schema Manipulation

2017-09-24 Thread Jörn Franke
insert into dep_av values(8,null) should do what you intent. > On 24. Sep 2017, at 03:03, BD wrote: > > Hi , > > I have imported (using sqoop) departments table from retail_db in hdfs as > avro file. Have created an external table stored as hive and used the avro >

Re: EMR 5.8 & Hue/Hive Performance/Stability Specifics

2017-09-12 Thread Jörn Franke
Test it, because it really depends what you do. Since you use hue you seem to be interested in interactive analysis, so the best is to use Tez and llap as a hive engine. Make also sure that you use ORC or Parquet as a Hive storage format. Leverage the in-build orc or parquet indexes by sorting

Re: Why is a single INSERT very slow in Hive?

2017-09-11 Thread Jörn Franke
Why do you want to do single inserts? It has been more designed for bulk loads. In any case newer version of Hive 2 using TEZ +llap improve it significantly (also for bulk analysis). Nevertheless, it is good practice to not use single inserts in an analysis systems, but try to combine and

Re: Hive index + Tez engine = no performance gain?!

2017-08-21 Thread Jörn Franke
Parquet has also internal indexes. So no need for Hive index there. For fast ad-hoc queries you can use Tez +llap. Here you could use parquet or convert via CTAS easily to Orc. However you need to check if ORC is faster than Parquet depending on your data, queries and configuration (bloom

Re: Can one classify Hive as an analytical tool besides storage?

2017-08-14 Thread Jörn Franke
Depends on your definition of analytical and storage tool. I think Hive (especially with TEZ+llap) would qualify as an analytical tool, especially because you can extend it with sql procedures, can use any Java function directly in sql , it has wide range of analytical functions etc. The

Re: Using JAR files located on HDFS for SerDe

2017-04-12 Thread Jörn Franke
I do not think it is supported. The jar for Hive must be on a local filesystem of the Hive server (not necessarily on all nodes). > On 12. Apr 2017, at 16:57, Mahdi Mohammadinasab wrote: > > Hello, > > I am trying to add a JAR file which is located on HDFS to be later used

Re: Using Sqoop to get data from Impala/Hive to another Hive table

2017-02-21 Thread Jörn Franke
property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 21 February 2017 at 10:26, Jörn Franke <jo

Re: Using Sqoop to get data from Impala/Hive to another Hive table

2017-02-21 Thread Jörn Franke
Hallo, I have not tried it, but sqoop supports any jdbc driver. However, since the SQL syntax is not necessarily standardized you may face issues or performance problems. Hive itself has a nice import and export tool that supports also the metadata import/export. It can be orchestrated from

Re: Need help on connecting Tableau to Hive

2017-01-22 Thread Jörn Franke
You need to install the Hortonworks ODBC drivers (just Google them) on Windows. Tableau does not include any drivers, but only Software to use these drivers to connect to whatever database you need. > On 22 Jan 2017, at 03:15, Raymond Xie wrote: > > Hello, > > I have

Re: File not found of TEZ libraries with tez.lib.uris configuration

2017-01-16 Thread Jörn Franke
Sorry never mind my previous mail... in the stack it seems to look exactly for this file. Can you try to download the file? Can you check if these are all files needed? I think you need to extract the .tar.gz and point to the jars (check the Tez web site for the confit). > On 17 Jan 2017, at

Re: File not found of TEZ libraries with tez.lib.uris configuration

2017-01-16 Thread Jörn Franke
Maybe the wrong configuration file is picked up? > On 17 Jan 2017, at 07:44, wenxing zheng wrote: > > Dear all, > > I met an issue in the TEZ configuration for HIVE, as from the HIVE logs file: > >> Caused by: java.io.FileNotFoundException: File does not exist: >>

Re: Hive Stored Textfile to Stored ORC taking long time

2016-12-09 Thread Jörn Franke
gt; > We take the files every day once so if I put them in textfile and then to ORC > it will take me almost half a day just to display the data. > > It is basicly a time consuming task, and want to do it much quicker. A better > solution of course would be to put smaller files with F

Re: Hive Stored Textfile to Stored ORC taking long time

2016-12-09 Thread Jörn Franke
How large is the file? Might IO be an issue? How many disks have you on the only node? Do you compress the ORC (snappy?). What is the Hadoop distribution? Configuration baseline? Hive version? Not sure if i understood your setup, but might network be an issue? > On 9 Dec 2016, at 02:08,

Re: s3a and hive

2016-11-14 Thread Jörn Franke
Is it a permission issue on the folder? > On 15 Nov 2016, at 06:28, Stephen Sprague wrote: > > so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive and > see what would happen. > > running this query: > > insert overwrite table

Re: Hive metadata on Hbase

2016-10-23 Thread Jörn Franke
I think the main gain is more about getting rid of a dedicated database including maintenance and potential license cost. For really large clusters and a lot of users this might be even more beneficial. You can avoid clustering the database etc. > On 24 Oct 2016, at 00:46, Mich Talebzadeh

Re: Query consuming all resources

2016-09-28 Thread Jörn Franke
You need to configure queues in yarn and use the fairscheduler. From your use case it looks like you need to also configure pre-emption > On 28 Sep 2016, at 00:52, Jose Rozanec wrote: > > Hi, > > We have a Hive cluster. We notice that some queries consume all

Re: populating Hive table periodically from files on HDFS

2016-09-25 Thread Jörn Franke
I think what you propose makes sense. If you would do a delta load you gain not much performance benefits (most likely you will have less performance because you need to figure out what has changed, have the typical issues of distributed systems that some changes may arrive later, error

Re: need help with hive session getting killed

2016-09-20 Thread Jörn Franke
Increase timeout or let the result of the query be written in a dedicated table. > On 20 Sep 2016, at 16:57, anup ahire wrote: > > > > > Hello, > > I am using hive-jdbc-1.2.1 to run a query. Query runs around an hour and > eventually completes. > But my hive session

Re: Hive 2.x usage

2016-09-14 Thread Jörn Franke
If you are using a distribution (which you should if you go to production - Apache releases should not be used due to the maintainability, complexity and interaction with other components, such as Hadoop etc) then wait until a distribution with 2.x is out. As far as i am aware there is

Re: Hive or Pig - Which one gives best performance for reading HBase data

2016-09-14 Thread Jörn Franke
They should be rather similar, you may gain some performance using Tez or Spark as an execution engine but in an export scenario do not expect much performance improvements. In any scenario avoid to have only one reducer, but use several ones, e.g. by exporting to multiple output files instead

Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

2016-08-24 Thread Jörn Franke
6 3:07 AM, "Joel Victor" <joelsvic...@gmail.com> wrote: >> @Jörn: If I understood correctly even later versions of Hive won't be able >> to handle these kinds of workloads? >> >>> On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke <jornfra...@gmail.com> wr

Re: Loading Sybase to hive using sqoop

2016-08-24 Thread Jörn Franke
Is your Sybase server ready to deliver a large amount of data? (Network, memory, cpu, parallel access, resources etc) This is usually the problem when loading data from a relational database and less sqoop / mr or spark. Then, you should have a recent Hive version and store in Orc or parquet

Re: HIVE on Windows

2016-08-24 Thread Jörn Franke
it possible to run > Hive on Windows machine ? Thanks > >> On Wednesday, May 18, 2016, Me To <ektapaliwal2...@gmail.com> wrote: >> Thanks so much for replying:) >> >> so without distribution, I will not able to do that? >> >>> On Wed, May 18, 2016

Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

2016-08-24 Thread Jörn Franke
I think Hive especially these old versions have not been designed for this. Why not store them in Hbase and run a oozie job regularly that puts them all into Hive /Orc or parquet in a bulk job? > On 24 Aug 2016, at 09:35, Joel Victor wrote: > > Currently I am using

Re: hive concurrency not working

2016-08-06 Thread Jörn Franke
liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 5 August 2016 at 09:01, Jörn Franke <jornfra...@gmail.com> wrote: >> That is not correct the option is there to install it. >> >>> On 05 Aug 2016, at 08:41, Mich

Re: hive concurrency not working

2016-08-05 Thread Jörn Franke
Depends on how you configured scheduling in yarn ... > On 05 Aug 2016, at 08:39, Mich Talebzadeh wrote: > > you won't have this problem if you use Spark as the execution engine? That > handles concurrency OK > > Dr Mich Talebzadeh > > LinkedIn >

Re: hive concurrency not working

2016-08-05 Thread Jörn Franke
That is not correct the option is there to install it. > On 05 Aug 2016, at 08:41, Mich Talebzadeh wrote: > > You won't have this problem if you use Spark as the execution engine! This > set up handles concurrency but Hive with Spark is not part of the HW distro. >

Re: Vectorised Query Execution extension

2016-08-04 Thread Jörn Franke
Even if it is possible it does only make sense to a certain limit given by your CPU and CPU caches. > On 04 Aug 2016, at 22:57, Mich Talebzadeh wrote: > > As I understand from the manual: > > Vectorized query execution is a Hive feature that greatly reduces the CPU

Re: hive concurrency not working

2016-08-03 Thread Jörn Franke
You need to configure the yarn scheduler (fair or capacity depending on your needs) > On 03 Aug 2016, at 15:14, Raj hadoop wrote: > > Dear All, > > In need or your help, > > we have horton works 4 node cluster,and the problem is hive is allowing only > one user at a

Re: Doubt on Hive Partitioning.

2016-08-02 Thread Jörn Franke
Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian <qiuzhuang.l...@gmail.com> >> wrote: >> Is this partition pruning fixed in MR too except for TEZ in newer hive >> version? >> >> Regards, >> Q >> >>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Frank

Re: Doubt on Hive Partitioning.

2016-08-02 Thread Jörn Franke
I do not think so, but never tested it. > On 02 Aug 2016, at 03:45, Qiuzhuang Lian <qiuzhuang.l...@gmail.com> wrote: > > Is this partition pruning fixed in MR too except for TEZ in newer hive > version? > > Regards, > Q > >> On Mon, Aug 1, 2016 at 8:48 P

Re: Doubt on Hive Partitioning.

2016-08-01 Thread Jörn Franke
It happens in old hive version of the filter is only in the where clause and NOT in the join clause. This should not happen in newer hive version. You can check it by executing explain dependency query. > On 01 Aug 2016, at 11:07, Abhishek Dubey wrote: > > Hi All,

Re: Hive External Storage Handlers

2016-07-19 Thread Jörn Franke
responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruct

Re: hive external table on gzip

2016-07-19 Thread Jörn Franke
Gzip is transparently handled by Hive (* by the formats available in Hive. If it is a custom format it depends on it).. What format is the table (csv? Json?) depending on that you simply choose the corresponding serde and it transparently does the decompression. Keep in mind that gzip is not

Re: Hive External Storage Handlers

2016-07-18 Thread Jörn Franke
So not use a self-compiled hive or Spark version, but only the ones supplied by distributions (cloudera, Hortonworks, Bigtop...) You will face performance problems, strange errors etc when building and testing your code using self-compiled versions. If you use the Hive APIs then the engine

Re: Hive on TEZ + LLAP

2016-07-15 Thread Jörn Franke
I would recommend a distribution such as Hortonworks were everything is already configured. As far as I know llap is currently not part of any distribution. > On 15 Jul 2016, at 17:04, Ashok Kumar wrote: > > Hi, > > Has anyone managed to make Hive work with Tez + LLAP as

Re: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-14 Thread Jörn Franke
> e.g. > > hive> select java_method ('java.lang.Math','min',45,9) ; > 9 > > I’m not sure how it serves out purpose. > > Dudu > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Thursday, July 14, 2016 8:55 AM > To: user@hive.apache.org > Subjec

Re: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-13 Thread Jörn Franke
You can use use any Java function in Hive without (!) the need to wrap it in an UDF via the reflect command. however not sure if this meets your use case. Sent from my iPhone > On 13 Jul 2016, at 19:50, Markovitz, Dudu wrote: > > Hi > > I’m personally not aware of

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Jörn Franke
I think the comparison with Oracle rdbms and oracle times ten is not so good. There are times when the in-memory database of Oracle is slower than the rdbms (especially in case of Exadata) due to the issue that in-memory - as in Spark - means everything is in memory and everything is always

Re: Trouble trying to get started with hive

2016-07-11 Thread Jörn Franke
Please use a Hadoop distribution to avoid these configuration issues (in the beginning). > On 05 Jul 2016, at 12:06, Kari Pahula wrote: > > Hi. I'm trying to familiarize myself with Hadoop and various projects related > to it. > > I've been following >

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Jörn Franke
+ LLAP => DAG + in-memory caching >> >> OK it is another way getting the same result. However, my concerns: >> >> Spark has a wide user base. I judge this from Spark user group traffic >> TEZ user group has no traffic I am afraid >> LLAP I don't know

Re: Optimize Hive Query

2016-06-23 Thread Jörn Franke
The query looks a little bit too complex from what it is supposed to do. Can you reformulate and restrict the data in a where clause (highest restriction first). Another hint would be to use the Orc format (with indexes and optionally bloom filters) with snappy compression as well as sorting

Re: if else condition in hive

2016-06-21 Thread Jörn Franke
I recommend you to rethink it as part of a bulk transfer potentially even using separate partitions. Will be much faster. > On 21 Jun 2016, at 13:22, raj hive wrote: > > Hi friends, > > INSERT,UPDATE,DELETE commands are working fine in my Hive environment after >

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-21 Thread Jörn Franke
gt; interface to Hive. Are you saying that the reference command line interface > is not efficiently implemented? :) > > -David Nies > >> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jornfra...@gmail.com>: >> >> Aside from this the low network performanc

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Jörn Franke
Aside from this the low network performance could also stem from the Java application receiving the JDBC stream (not threaded / not efficiently implemented etc). However that being said, do not use jdbc for this. > On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com> wrote: &

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Jörn Franke
Hallo, For no databases (including traditional ones) it is advisable to fetch this amount through jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes). It is a highly questionable approach from a reliability point of view. Export it as file to HDFS and

Re: Hive indexes without improvement of performance

2016-06-16 Thread Jörn Franke
The indexes are based on HDFS blocksize, which is usually around 128 mb. This means for hitting a single row you must always load the full block. In traditional databases this blocksize it is much faster. If the optimizer does not pick up the index then you can query the index directly (it is

Re: insert query in hive

2016-06-08 Thread Jörn Franke
This is not the recommended way to load large data volumes into Hive. Check the external table feature, scoop, and the Orc/parquet formats > On 08 Jun 2016, at 14:03, raj hive wrote: > > Hi Friends, > > I have to insert the data into hive table from Java program. Insert

Re: Convert date in string format to timestamp in table definition

2016-06-05 Thread Jörn Franke
Never use string when you can use int - the performance will be much better - especially for tables in Orc / parquet format > On 04 Jun 2016, at 22:31, Igor Kravzov wrote: > > Thanks Dudu. > So if I need actual date I will use view. > Regarding partition column: I

Re: Internode Encryption with HiveServer2

2016-06-03 Thread Jörn Franke
This can be configured on the Hadoop level. > On 03 Jun 2016, at 10:59, Nick Corbett wrote: > > Hi > > > I am deploying Hive in a regulated environment - all data needs to be > encrypted when transferred and at rest. > > > If I run a 'select' statement, using

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Jörn Franke
Thanks very interesting explanation. Looking forward to test it. > On 31 May 2016, at 07:51, Gopal Vijayaraghavan wrote: > > >> That being said all systems are evolving. Hive supports tez+llap which >> is basically the in-memory support. > > There is a big difference

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
blem is that the TEZ user group is exceptionally >>> quiet. Just sent an email to Hive user group to see anyone has managed to >>> built a vendor independent version. >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
divided on this (use Hive with TEZ) or use Impala instead of Hive > etc as I am sure you already know. > > Cheers, > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >

Re: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Jörn Franke
Both have outdated versions, usually one can support you better if you upgrade to the newest. Firewall could be an issue here. > On 26 May 2016, at 10:11, Nikolay Voronchikhin > wrote: > > Hi PySpark users, > > We need to be able to run large Hive queries in PySpark

Re: Copying all Hive tables from Prod to UAT

2016-05-26 Thread Jörn Franke
Or use Falcon ... The Spark JDBC I would try to avoid. Jdbc is not designed for these big data bulk operations, eg data has to be transferred uncompressed and there is the serialization/deserialization issue query result -> protocol -> Java objects -> writing to specific storage format etc

Re: Hive and XML

2016-05-22 Thread Jörn Franke
XML is generally slow in any software. It is not recommended for large data volumes. > On 22 May 2016, at 10:15, Maciek wrote: > > Have you had to load XML data into Hive? Did you run into any problems or > experienced any pain points, e.g. complex schemas or performance? >

Re: HIVE on Windows

2016-05-18 Thread Jörn Franke
Use a distribution, such as Hortonworks > On 18 May 2016, at 19:09, Me To wrote: > > Hello, > > I want to install hive on my windows machine but I am unable to find any > resource out there. I am trying to set up it from one month but unable to > accomplish that.

Re: Query Failing while querying on ORC Format

2016-05-17 Thread Jörn Franke
I do not remember exactly, but I think it worked simply by adding a new partition to the old table with the additional columns. > On 17 May 2016, at 15:00, Mich Talebzadeh wrote: > > Hi Mahendar, > > That version 1.2 is reasonable. > > One alternative is to create

Re: Performance for hive external to hbase with serval terabyte or more data

2016-05-11 Thread Jörn Franke
Why don't you export the data from hbase to hive, eg in Orc format. You should not use mr with Hive, but Tez. Also use a recent hive version (at least 1.2). You can then do queries there. For large log file processing in real time, one alternative depending on your needs could be Solr on

Re: Making sqoop import use Spark engine as opposed to MapReduce for Hive

2016-04-30 Thread Jörn Franke
I do not think you make it faster by setting the execution engine to Spark. Especially with such an old Spark version. For such simple things such as "dump" bulk imports and exports, it does matter much less if it all what execution engine you use. There was recently a discussion on that on the

Re: Container out of memory: ORC format with many dynamic partitions

2016-04-30 Thread Jörn Franke
I would still need some time to dig deeper in this. Are you using a specific distribution? Would it be possible to upgrade to a more recent Hive version? However, having so many small partitions is a bad practice which seriously affects performance. Each partition should at least contain

Analyzing Bitcoin blockchain data with Hive

2016-04-29 Thread Jörn Franke
Dear all, I prepared a small Serde to analyze Bitcoin blockchain data with Hive: https://snippetessay.wordpress.com/2016/04/28/hive-bitcoin-analytics-on-blockchain-data-with-sql/ There are some example queries, but I will add some in the future. Additionally, more unit tests will be added. Let

Re: Sqoop_Sql_blob_types

2016-04-27 Thread Jörn Franke
You could try as binary. Is it just for storing the blobs or for doing analyzes on them? In the first case you may think about storing them as files in HDFS and including in hive just a string containing the file name (to make analysis on the other data faster). In the later case you should

Re: Hive external indexes incorporation into Hive CBO

2016-04-21 Thread Jörn Franke
I am still not sure why you think they are not used. The main issue is that the block size is usually very large (eg 256 MB compared to kilobytes / sometimes few megabytes in traditional databases) and the indexes refer to blocks. This makes it less likely that you can leverage it for small

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Hive has working indexes. However many people overlook that a block is usually much larger than in a relational database and thus do not use them right. > On 19 Apr 2016, at 09:31, Mich Talebzadeh wrote: > > The issue is that Hive has indexes (not index store) but

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Depends really what you want to do. Hive is more for queries involving a lot of data, whereby hbase+Phoenix is more for oltp scenarios or sensor ingestion. I think the reason is that hive has been the entry point for many engines and formats. Additionally there is a lot of tuning capabilities

Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Jörn Franke
You could also explore the in-memory database of 12c . However, I am not sure how beneficial it is for Oltp scenarios. I am excited to see how the performance will be on hbase as a hive metastore. Nevertheless, your results on Oracle/SSD will be beneficial for the community. > On 17 Apr 2016,

  1   2   >