Why don’t you write them directly on local storage and then write them all to
HDFS?
Then you can create an external table in Hive on them and do analyses
> Am 20.03.2020 um 08:30 schrieb "wangl...@geekplus.com.cn"
> :
>
>
>
You could just move the ints outside the Map.
Alternatively you can convert the String to Int : cast (strcolumn to int)
See:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions
> Am 12.08.2019 um 21:41 schrieb Anup Tiwari :
>
> Hi All,
>
Do you use the HiveContext in Spark? Do you configure the same options there?
Can you share some code?
> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade :
>
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0.
> Even if i use parquet files the result would be same, because after all
> sparkSQL
You have to create a new table with this column as varchar and do a select
insert from the old table.
> Am 18.07.2019 um 01:14 schrieb William Shen :
>
> Hi all,
>
> I assumed that it should be compatible to convert column type varchar to
> string, however, after running ALTER TABLE table
Which hive version and engine?
If it is tez then you can also try mr as an engine set hive.execution.engine=mr
that will use less memory. Check also the max heap space configuration on the
nodes . Maybe you have physically 16 gb memory but the Java process takes only
4 or so memory.
Maybe
Can you please provide us more details:
Number of rows in each table and per partition, the table structure, hive
version, table format, is table sorted or partitioned on dt?
Why don’t you use a join, potentially with a mapjoin hint?
> Am 19.12.2018 um 09:02 schrieb Prabhakar Reddy :
>
>
done on it in a long time. A
> simple UDF isn't capable of providing true unique sequence support.
>
> Thanks
> Shawn
>
> -Original Message-
> From: Jörn Franke
> Sent: Saturday, September 15, 2018 6:09 AM
> To: user@hive.apache.org
> Subject: Re: [featur
If you really need it then you can write an UDF for it.
> On 15. Sep 2018, at 11:54, Nicolas Paris wrote:
>
> Hi
>
> Hive does not provide auto-increment columns (=sequences). Is there any
> chance that feature will be provided in the future ?
>
> This is one of the highest limitation in
You can partition it and only compute statistics for new partitions...
> On 26. Aug 2018, at 12:43, Prabhakar Reddy wrote:
>
> Hello,
>
> Are there any properties that I can set to improve the performance of Analyze
> table compute statistics statement.My data sits in s3 and I see it's taking
No parquet and orc have internal compression which must be used over the
external compression that you are referring to.
Internal compression can be decompressed in parallel which is significantly
faster. Internally parquet supports only snappy, gzip,lzo, brotli (2.4.), lz4
(2.4), zstd (2.4).
Hadoop 3.0 brings anyway some interesting benefits such as reduced storage
needs (you dont need to replicate anymore 3 times for reliability reasons), so
that may be convincing.
> On 22. Jul 2018, at 08:28, 彭鱼宴 <461292...@qq.com> wrote:
>
> Hi Tanvi,
>
> Thanks! I will check that and have a
It would be good if you can document this on the Hive wiki so that other users
know it.
On the other hand there is Apache Bigtop which tests integration of various Big
Data components - but it is complicated. Behind a big data distribution there
is a lot of effort.
> On 3. Jul 2018, at
> damage or destruction of data or any other property which may arise from
> relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>> On 9
d. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>> On 8 June 2018 at 21:56, Jörn Franke wrote:
>>> Oha i see now Serde is a deprecated Interface , if i am not wro
A-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.properties
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpress.com
>
> Disclaimer: Use it at you
Can you get the log files and start Hive with more detailled logs?
In could be that not all libraries are loaded (i don’t remember anymore but I
think this one needs more , I can look next week in my docs) or that it does
not support maps (not sure).
You can try first with a more simpler
d point me in
> the direction of anything I’ve missed.
>
> Thanks,
>
> Elliot.
>
>> On Sun, 13 May 2018 at 15:42, Jörn Franke <jornfra...@gmail.com> wrote:
>> In detail you can check the source code, but a Serde needs to translate an
>> object to a Hiv
May 2018, at 17:08, 侯宗田 <zongtian...@icloud.com> wrote:
>
> Thank you, it makes the concept clearer to me. I think I need to look up the
> source code for some details.
>> 在 2018年5月13日,下午10:42,Jörn Franke <jornfra...@gmail.com> 写道:
>>
>> In detail you ca
In detail you can check the source code, but a Serde needs to translate an
object to a Hive object and vice versa. Usually this is very simple (simply
passing the object or create A HiveDecimal etc). It also provides an
ObjectInspector that basically describes an object in more detail (eg to be
Check the Json serde:
https://cwiki.apache.org/confluence/display/Hive/SerDe
> On 22. Apr 2018, at 09:09, Mahender Sarangam
> wrote:
>
> Hi,
>
> we have to read Gz compressed JSON File from Source System. I see they are 3
> different ways of reading JSON data.
, I was thinking to give a custom UI.
>
> Next read from UI data and build UDFs using the rules defined outside the UDF.
>
> 1 UDF per data object.
>
> Not sure these are just thoughts.
>
>> On Mon, Apr 16, 2018 at 1:40 PM Jörn Franke <jornfra...@gmail.com> wrote
I would not use Drools with Spark, it does not scale to the distributed setting.
You could translate the rules to hive queries but this would not be exactly the
same thing.
> On 16. Apr 2018, at 17:59, Joel D wrote:
>
> Hi,
>
> Any suggestions on how to implement
HDFS support depends on the version. A long time it was not supported.
> On 23. Feb 2018, at 21:08, Andy Srine wrote:
>
> Team,
>
> Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;)
> supported in hiveserver2 via an ODBC connection?
>
> Some
Add jar works only with local files on the Hive server.
> On 23. Feb 2018, at 21:08, Andy Srine wrote:
>
> Team,
>
> Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;)
> supported in hiveserver2 via an ODBC connection?
>
> Some relevant points:
>
Are you looking for sth like this:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
To answer your original question: why not implement the whole job in Hive? Or
orchestrate using oozie some parts in mr and some in Huve.
> On 30. Jan 2018,
Drop the old parquet table before and then create it with explicit statements.
The above statement keeps using the old parquet table if it existed
> On 26. Jan 2018, at 17:35, Brandon Cooke wrote:
>
> Hi Prasad,
>
> I actually have tried this and I had that same
How do you import data ? Bulk import?
What about using partitions ( or is the data too small for daily partitions?)
> On 2. Jan 2018, at 04:59, l...@china-inv.cn wrote:
>
> Hi, All,
>
> We are using Hive to persist our data and we run cron jobs to import new data
> into Hive daily.
>
> At
I also recommend it you will have also performance improvements with JDK8 in
general (use the latest version).
Keep also in mind that more and more big data libraries etc will drop JDK7
support soon (Aside that JDK7 is anyway not maintained anymore).
> On 29. Nov 2017, at 01:31, Johannes
orization. So not sure how hive
> storage base authorization will provided additional security. Definitely I am
> missing something.
>
> Please suggest.
>
> Thanks,
> Vijay
>
>> On Thu, Nov 9, 2017 at 1:55 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
Then you need to kerberize it to support what you want
> On 9. Nov 2017, at 09:18, Vijay Toshniwal <vijay.toshni...@gmail.com> wrote:
>
> No its not.
>
> Thanks,
> Vijay
>
>> On Thu, Nov 9, 2017 at 1:09 PM, Jörn Franke <jornfra...@gmail.com> wrot
Is your Hadoop cluster kerberized?
> On 9. Nov 2017, at 06:57, Vijay Toshniwal wrote:
>
> Hi Team,
>
>
>
> I am facing issues while configuring hive storage based authorization. I
> followed the steps mentioned in
>
Yes it is better to push columns to the view so that users can filter on them.
Alternatively you can use hpl/sql on Hive.
I think there are very few (if any) use cases to support parametrized views .
> On 2. Oct 2017, at 16:12, Elliot West wrote:
>
> Hello,
>
> Does any
You should try with TEZ+LLAP.
Additionally you will need to compare different configurations.
Finally just any comparison is meaningless.
You should use queries, data and file formats that your users are using later.
> On 2. Oct 2017, at 03:06, Stephen Sprague wrote:
>
>
insert into dep_av values(8,null) should do what you intent.
> On 24. Sep 2017, at 03:03, BD wrote:
>
> Hi ,
>
> I have imported (using sqoop) departments table from retail_db in hdfs as
> avro file. Have created an external table stored as hive and used the avro
>
Test it, because it really depends what you do. Since you use hue you seem to
be interested in interactive analysis, so the best is to use Tez and llap as a
hive engine. Make also sure that you use ORC or Parquet as a Hive storage
format. Leverage the in-build orc or parquet indexes by sorting
Why do you want to do single inserts?
It has been more designed for bulk loads.
In any case newer version of Hive 2 using TEZ +llap improve it significantly
(also for bulk analysis). Nevertheless, it is good practice to not use single
inserts in an analysis systems, but try to combine and
Parquet has also internal indexes. So no need for Hive index there.
For fast ad-hoc queries you can use Tez +llap. Here you could use parquet or
convert via CTAS easily to Orc. However you need to check if ORC is faster than
Parquet depending on your data, queries and configuration (bloom
Depends on your definition of analytical and storage tool. I think Hive
(especially with TEZ+llap) would qualify as an analytical tool, especially
because you can extend it with sql procedures, can use any Java function
directly in sql , it has wide range of analytical functions etc.
The
I do not think it is supported. The jar for Hive must be on a local filesystem
of the Hive server (not necessarily on all nodes).
> On 12. Apr 2017, at 16:57, Mahdi Mohammadinasab wrote:
>
> Hello,
>
> I am trying to add a JAR file which is located on HDFS to be later used
property which may arise from
> relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>> On 21 February 2017 at 10:26, Jörn Franke <jo
Hallo,
I have not tried it, but sqoop supports any jdbc driver. However, since the SQL
syntax is not necessarily standardized you may face issues or performance
problems. Hive itself has a nice import and export tool that supports also the
metadata import/export. It can be orchestrated from
You need to install the Hortonworks ODBC drivers (just Google them) on Windows.
Tableau does not include any drivers, but only Software to use these drivers to
connect to whatever database you need.
> On 22 Jan 2017, at 03:15, Raymond Xie wrote:
>
> Hello,
>
> I have
Sorry never mind my previous mail... in the stack it seems to look exactly for
this file. Can you try to download the file? Can you check if these are all
files needed? I think you need to extract the .tar.gz and point to the jars
(check the Tez web site for the confit).
> On 17 Jan 2017, at
Maybe the wrong configuration file is picked up?
> On 17 Jan 2017, at 07:44, wenxing zheng wrote:
>
> Dear all,
>
> I met an issue in the TEZ configuration for HIVE, as from the HIVE logs file:
>
>> Caused by: java.io.FileNotFoundException: File does not exist:
>>
gt;
> We take the files every day once so if I put them in textfile and then to ORC
> it will take me almost half a day just to display the data.
>
> It is basicly a time consuming task, and want to do it much quicker. A better
> solution of course would be to put smaller files with F
How large is the file? Might IO be an issue? How many disks have you on the
only node?
Do you compress the ORC (snappy?).
What is the Hadoop distribution? Configuration baseline? Hive version?
Not sure if i understood your setup, but might network be an issue?
> On 9 Dec 2016, at 02:08,
Is it a permission issue on the folder?
> On 15 Nov 2016, at 06:28, Stephen Sprague wrote:
>
> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive and
> see what would happen.
>
> running this query:
>
> insert overwrite table
I think the main gain is more about getting rid of a dedicated database
including maintenance and potential license cost.
For really large clusters and a lot of users this might be even more
beneficial. You can avoid clustering the database etc.
> On 24 Oct 2016, at 00:46, Mich Talebzadeh
You need to configure queues in yarn and use the fairscheduler. From your use
case it looks like you need to also configure pre-emption
> On 28 Sep 2016, at 00:52, Jose Rozanec wrote:
>
> Hi,
>
> We have a Hive cluster. We notice that some queries consume all
I think what you propose makes sense. If you would do a delta load you gain not
much performance benefits (most likely you will have less performance because
you need to figure out what has changed, have the typical issues of distributed
systems that some changes may arrive later, error
Increase timeout or let the result of the query be written in a dedicated table.
> On 20 Sep 2016, at 16:57, anup ahire wrote:
>
>
>
>
> Hello,
>
> I am using hive-jdbc-1.2.1 to run a query. Query runs around an hour and
> eventually completes.
> But my hive session
If you are using a distribution (which you should if you go to production -
Apache releases should not be used due to the maintainability, complexity and
interaction with other components, such as Hadoop etc) then wait until a
distribution with 2.x is out. As far as i am aware there is
They should be rather similar, you may gain some performance using Tez or Spark
as an execution engine but in an export scenario do not expect much performance
improvements.
In any scenario avoid to have only one reducer, but use several ones, e.g. by
exporting to multiple output files instead
6 3:07 AM, "Joel Victor" <joelsvic...@gmail.com> wrote:
>> @Jörn: If I understood correctly even later versions of Hive won't be able
>> to handle these kinds of workloads?
>>
>>> On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke <jornfra...@gmail.com> wr
Is your Sybase server ready to deliver a large amount of data? (Network,
memory, cpu, parallel access, resources etc) This is usually the problem when
loading data from a relational database and less sqoop / mr or spark.
Then, you should have a recent Hive version and store in Orc or parquet
it possible to run
> Hive on Windows machine ? Thanks
>
>> On Wednesday, May 18, 2016, Me To <ektapaliwal2...@gmail.com> wrote:
>> Thanks so much for replying:)
>>
>> so without distribution, I will not able to do that?
>>
>>> On Wed, May 18, 2016
I think Hive especially these old versions have not been designed for this. Why
not store them in Hbase and run a oozie job regularly that puts them all into
Hive /Orc or parquet in a bulk job?
> On 24 Aug 2016, at 09:35, Joel Victor wrote:
>
> Currently I am using
liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>> On 5 August 2016 at 09:01, Jörn Franke <jornfra...@gmail.com> wrote:
>> That is not correct the option is there to install it.
>>
>>> On 05 Aug 2016, at 08:41, Mich
Depends on how you configured scheduling in yarn ...
> On 05 Aug 2016, at 08:39, Mich Talebzadeh wrote:
>
> you won't have this problem if you use Spark as the execution engine? That
> handles concurrency OK
>
> Dr Mich Talebzadeh
>
> LinkedIn
>
That is not correct the option is there to install it.
> On 05 Aug 2016, at 08:41, Mich Talebzadeh wrote:
>
> You won't have this problem if you use Spark as the execution engine! This
> set up handles concurrency but Hive with Spark is not part of the HW distro.
>
Even if it is possible it does only make sense to a certain limit given by your
CPU and CPU caches.
> On 04 Aug 2016, at 22:57, Mich Talebzadeh wrote:
>
> As I understand from the manual:
>
> Vectorized query execution is a Hive feature that greatly reduces the CPU
You need to configure the yarn scheduler (fair or capacity depending on your
needs)
> On 03 Aug 2016, at 15:14, Raj hadoop wrote:
>
> Dear All,
>
> In need or your help,
>
> we have horton works 4 node cluster,and the problem is hive is allowing only
> one user at a
Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian <qiuzhuang.l...@gmail.com>
>> wrote:
>> Is this partition pruning fixed in MR too except for TEZ in newer hive
>> version?
>>
>> Regards,
>> Q
>>
>>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Frank
I do not think so, but never tested it.
> On 02 Aug 2016, at 03:45, Qiuzhuang Lian <qiuzhuang.l...@gmail.com> wrote:
>
> Is this partition pruning fixed in MR too except for TEZ in newer hive
> version?
>
> Regards,
> Q
>
>> On Mon, Aug 1, 2016 at 8:48 P
It happens in old hive version of the filter is only in the where clause and
NOT in the join clause. This should not happen in newer hive version. You can
check it by executing explain dependency query.
> On 01 Aug 2016, at 11:07, Abhishek Dubey wrote:
>
> Hi All,
responsibility for any loss,
> damage or destruction of data or any other property which may arise from
> relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruct
Gzip is transparently handled by Hive (* by the formats available in Hive. If
it is a custom format it depends on it).. What format is the table (csv? Json?)
depending on that you simply choose the corresponding serde and it
transparently does the decompression. Keep in mind that gzip is not
So not use a self-compiled hive or Spark version, but only the ones supplied by
distributions (cloudera, Hortonworks, Bigtop...) You will face performance
problems, strange errors etc when building and testing your code using
self-compiled versions.
If you use the Hive APIs then the engine
I would recommend a distribution such as Hortonworks were everything is already
configured. As far as I know llap is currently not part of any distribution.
> On 15 Jul 2016, at 17:04, Ashok Kumar wrote:
>
> Hi,
>
> Has anyone managed to make Hive work with Tez + LLAP as
> e.g.
>
> hive> select java_method ('java.lang.Math','min',45,9) ;
> 9
>
> I’m not sure how it serves out purpose.
>
> Dudu
>
> From: Jörn Franke [mailto:jornfra...@gmail.com]
> Sent: Thursday, July 14, 2016 8:55 AM
> To: user@hive.apache.org
> Subjec
You can use use any Java function in Hive without (!) the need to wrap it in an
UDF via the reflect command.
however not sure if this meets your use case.
Sent from my iPhone
> On 13 Jul 2016, at 19:50, Markovitz, Dudu wrote:
>
> Hi
>
> I’m personally not aware of
I think the comparison with Oracle rdbms and oracle times ten is not so good.
There are times when the in-memory database of Oracle is slower than the rdbms
(especially in case of Exadata) due to the issue that in-memory - as in Spark -
means everything is in memory and everything is always
Please use a Hadoop distribution to avoid these configuration issues (in the
beginning).
> On 05 Jul 2016, at 12:06, Kari Pahula wrote:
>
> Hi. I'm trying to familiarize myself with Hadoop and various projects related
> to it.
>
> I've been following
>
+ LLAP => DAG + in-memory caching
>>
>> OK it is another way getting the same result. However, my concerns:
>>
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
The query looks a little bit too complex from what it is supposed to do. Can
you reformulate and restrict the data in a where clause (highest restriction
first). Another hint would be to use the Orc format (with indexes and
optionally bloom filters) with snappy compression as well as sorting
I recommend you to rethink it as part of a bulk transfer potentially even using
separate partitions. Will be much faster.
> On 21 Jun 2016, at 13:22, raj hive wrote:
>
> Hi friends,
>
> INSERT,UPDATE,DELETE commands are working fine in my Hive environment after
>
gt; interface to Hive. Are you saying that the reference command line interface
> is not efficiently implemented? :)
>
> -David Nies
>
>> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jornfra...@gmail.com>:
>>
>> Aside from this the low network performanc
Aside from this the low network performance could also stem from the Java
application receiving the JDBC stream (not threaded / not efficiently
implemented etc). However that being said, do not use jdbc for this.
> On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com> wrote:
&
Hallo,
For no databases (including traditional ones) it is advisable to fetch this
amount through jdbc. Jdbc is not designed for this (neither for import nor for
export of large data volumes). It is a highly questionable approach from a
reliability point of view.
Export it as file to HDFS and
The indexes are based on HDFS blocksize, which is usually around 128 mb. This
means for hitting a single row you must always load the full block. In
traditional databases this blocksize it is much faster. If the optimizer does
not pick up the index then you can query the index directly (it is
This is not the recommended way to load large data volumes into Hive. Check the
external table feature, scoop, and the Orc/parquet formats
> On 08 Jun 2016, at 14:03, raj hive wrote:
>
> Hi Friends,
>
> I have to insert the data into hive table from Java program. Insert
Never use string when you can use int - the performance will be much better -
especially for tables in Orc / parquet format
> On 04 Jun 2016, at 22:31, Igor Kravzov wrote:
>
> Thanks Dudu.
> So if I need actual date I will use view.
> Regarding partition column: I
This can be configured on the Hadoop level.
> On 03 Jun 2016, at 10:59, Nick Corbett wrote:
>
> Hi
>
>
> I am deploying Hive in a regulated environment - all data needs to be
> encrypted when transferred and at rest.
>
>
> If I run a 'select' statement, using
Thanks very interesting explanation. Looking forward to test it.
> On 31 May 2016, at 07:51, Gopal Vijayaraghavan wrote:
>
>
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
>
> There is a big difference
blem is that the TEZ user group is exceptionally
>>> quiet. Just sent an email to Hive user group to see anyone has managed to
>>> built a vendor independent version.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn
>>> https
divided on this (use Hive with TEZ) or use Impala instead of Hive
> etc as I am sure you already know.
>
> Cheers,
>
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
Both have outdated versions, usually one can support you better if you upgrade
to the newest.
Firewall could be an issue here.
> On 26 May 2016, at 10:11, Nikolay Voronchikhin
> wrote:
>
> Hi PySpark users,
>
> We need to be able to run large Hive queries in PySpark
Or use Falcon ...
The Spark JDBC I would try to avoid. Jdbc is not designed for these big data
bulk operations, eg data has to be transferred uncompressed and there is the
serialization/deserialization issue query result -> protocol -> Java objects ->
writing to specific storage format etc
XML is generally slow in any software. It is not recommended for large data
volumes.
> On 22 May 2016, at 10:15, Maciek wrote:
>
> Have you had to load XML data into Hive? Did you run into any problems or
> experienced any pain points, e.g. complex schemas or performance?
>
Use a distribution, such as Hortonworks
> On 18 May 2016, at 19:09, Me To wrote:
>
> Hello,
>
> I want to install hive on my windows machine but I am unable to find any
> resource out there. I am trying to set up it from one month but unable to
> accomplish that.
I do not remember exactly, but I think it worked simply by adding a new
partition to the old table with the additional columns.
> On 17 May 2016, at 15:00, Mich Talebzadeh wrote:
>
> Hi Mahendar,
>
> That version 1.2 is reasonable.
>
> One alternative is to create
Why don't you export the data from hbase to hive, eg in Orc format. You should
not use mr with Hive, but Tez. Also use a recent hive version (at least 1.2).
You can then do queries there. For large log file processing in real time, one
alternative depending on your needs could be Solr on
I do not think you make it faster by setting the execution engine to Spark.
Especially with such an old Spark version.
For such simple things such as "dump" bulk imports and exports, it does matter
much less if it all what execution engine you use.
There was recently a discussion on that on the
I would still need some time to dig deeper in this. Are you using a specific
distribution? Would it be possible to upgrade to a more recent Hive version?
However, having so many small partitions is a bad practice which seriously
affects performance. Each partition should at least contain
Dear all,
I prepared a small Serde to analyze Bitcoin blockchain data with Hive:
https://snippetessay.wordpress.com/2016/04/28/hive-bitcoin-analytics-on-blockchain-data-with-sql/
There are some example queries, but I will add some in the future.
Additionally, more unit tests will be added.
Let
You could try as binary. Is it just for storing the blobs or for doing analyzes
on them? In the first case you may think about storing them as files in HDFS
and including in hive just a string containing the file name (to make analysis
on the other data faster). In the later case you should
I am still not sure why you think they are not used. The main issue is that the
block size is usually very large (eg 256 MB compared to kilobytes / sometimes
few megabytes in traditional databases) and the indexes refer to blocks. This
makes it less likely that you can leverage it for small
Hive has working indexes. However many people overlook that a block is usually
much larger than in a relational database and thus do not use them right.
> On 19 Apr 2016, at 09:31, Mich Talebzadeh wrote:
>
> The issue is that Hive has indexes (not index store) but
Depends really what you want to do. Hive is more for queries involving a lot of
data, whereby hbase+Phoenix is more for oltp scenarios or sensor ingestion.
I think the reason is that hive has been the entry point for many engines and
formats. Additionally there is a lot of tuning capabilities
You could also explore the in-memory database of 12c . However, I am not sure
how beneficial it is for Oltp scenarios.
I am excited to see how the performance will be on hbase as a hive metastore.
Nevertheless, your results on Oracle/SSD will be beneficial for the community.
> On 17 Apr 2016,
1 - 100 of 167 matches
Mail list logo