[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

2014-10-16 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173455#comment-14173455
 ] 

Gunther Hagleitner commented on HIVE-8467:
--

Materialized views don't necessarily have to keep the tables in sync, do they? 
Other vendors allow deferred refreshes and for the user to specify integrity 
levels. I.e.: You can still put the onus on the user and you don't necessarily 
have to offer a background sync method (you can choose to additional options 
later.)

As far as other engines go - you have the same problem right? You can expose 
the table copy or view, but the smarts how and when to rewrite queries has to 
be built into each of those, or left to the user. With materialized views, 
other engines will also know how the tables are derived, which seems beneficial 
(well, if they speak SQL at least). For Pig and MR you will likely have to bake 
assumptions into the scripts/code.

Could say more about retention policy, max size and in general how you have 
seen ppl choose which partitions to add to the table copy? Is it typically the 
newest n partition? Or the last month of data? That'd be interesting - to see 
if it can be mapped on materialized views and how hard it'd be for the CBO can 
handle it.


 Table Copy - Background, incremental data load
 --

 Key: HIVE-8467
 URL: https://issues.apache.org/jira/browse/HIVE-8467
 Project: Hive
  Issue Type: New Feature
Reporter: Rajat Venkatesh
 Attachments: Table Copies.pdf


 Traditionally, Hive and other tools in the Hadoop eco-system havent required 
 a load stage. However, with recent developments, Hive is much more performant 
 when data is stored in specific formats like ORC, Parquet, Avro etc. 
 Technologies like Presto, also work much better with certain data formats. At 
 the same time, data is generated or obtained from 3rd parties in non-optimal 
 formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
 change the data format at the source. We've found that users either use 
 sub-optimal formats or spend a large amount of effort creating and 
 maintaining copies. We want to propose a new construct - Table Copy - to help 
 “load” data into an optimal storage format.
 I am going to attach a PDF document with a lot more details especially 
 addressing how is this different from bulk loads in relational DBs or 
 materialized views.
 Looking forward to hear if others see a similar need to formalize conversion 
 of data to different storage formats.  If yes, are the details in the PDF 
 document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

2014-10-16 Thread Rajat Venkatesh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173534#comment-14173534
 ] 

Rajat Venkatesh commented on HIVE-8467:
---

No they dont have to. The databases I know provide both options - sync on user 
input or automatically. I am not confident we can support automatic sync on 
external tables. Since it feels like a big feature gap, I chose a different 
name.

Yes - we also have diffs we would like to contribute in other projects to use 
Table Copy. Since the optimization is at the storage level, its very simple. 
Replace partitions from the table copy when possible.  Directories when it 
comes to Pig or M/R.  If materialized views are chosen, then the optimizers 
have to mature in more or less lock step. 

WRT to retention policy, the common case is to only keep the newest n 
partitions limited by size of the copy. We didnt chose a date range. Sometimes 
the date partition is not the top level one. This is a moving window. If older 
partitions are accessed then it will fall back to reading partitions from the 
Hive Table. 

 Table Copy - Background, incremental data load
 --

 Key: HIVE-8467
 URL: https://issues.apache.org/jira/browse/HIVE-8467
 Project: Hive
  Issue Type: New Feature
Reporter: Rajat Venkatesh
 Attachments: Table Copies.pdf


 Traditionally, Hive and other tools in the Hadoop eco-system havent required 
 a load stage. However, with recent developments, Hive is much more performant 
 when data is stored in specific formats like ORC, Parquet, Avro etc. 
 Technologies like Presto, also work much better with certain data formats. At 
 the same time, data is generated or obtained from 3rd parties in non-optimal 
 formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
 change the data format at the source. We've found that users either use 
 sub-optimal formats or spend a large amount of effort creating and 
 maintaining copies. We want to propose a new construct - Table Copy - to help 
 “load” data into an optimal storage format.
 I am going to attach a PDF document with a lot more details especially 
 addressing how is this different from bulk loads in relational DBs or 
 materialized views.
 Looking forward to hear if others see a similar need to formalize conversion 
 of data to different storage formats.  If yes, are the details in the PDF 
 document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

2014-10-15 Thread Julian Hyde (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172767#comment-14172767
 ] 

Julian Hyde commented on HIVE-8467:
---

I see this as a particular kind of materialized view. In general, a 
materialized view is a table whose contents are guaranteed to be the same as 
executing a particular query. In this case, that query is simply 'select * from 
t'.

We don't have materialized view support yet, but I have been working on 
lattices in Calcite (formerly known as Optiq) (see OPTIQ-344) and there is a 
lot of interest in adding them to Hive. Each materialized tile in a lattice 
is a materialized view of the form 'select d1, d2, sum(m1), count(m2) from t 
group by d1, d2'.

So, let's talk about whether we could change the syntax to 'create materialized 
view'  and still deliver the functionality you need. Of course if the user 
enters anything other than 'select * from t order by k1, k2' they would get an 
error.

In terms of query planning, I strongly recommend that you build on the CBO work 
powered by Calcite. Let's suppose there is a table T and a copy C. After 
translating the query to a Calcite RelNode tree, there will be a 
TableAccessRel(T). After reading the metadata, we should create a 
TableAccessRel(C) and tell Calcite that it is equivalent.

That's all you need to do. Calcite will take it from there. Assuming the stats 
indicate that C is better (and they should, right, because the ORC 
representation will be smaller?) then the query will end up using C. But if, 
say, T has a partitioning scheme which is more suitable for a particular query, 
then Calcite will choose T.

 Table Copy - Background, incremental data load
 --

 Key: HIVE-8467
 URL: https://issues.apache.org/jira/browse/HIVE-8467
 Project: Hive
  Issue Type: New Feature
Reporter: Rajat Venkatesh
 Attachments: Table Copies.pdf


 Traditionally, Hive and other tools in the Hadoop eco-system havent required 
 a load stage. However, with recent developments, Hive is much more performant 
 when data is stored in specific formats like ORC, Parquet, Avro etc. 
 Technologies like Presto, also work much better with certain data formats. At 
 the same time, data is generated or obtained from 3rd parties in non-optimal 
 formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
 change the data format at the source. We've found that users either use 
 sub-optimal formats or spend a large amount of effort creating and 
 maintaining copies. We want to propose a new construct - Table Copy - to help 
 “load” data into an optimal storage format.
 I am going to attach a PDF document with a lot more details especially 
 addressing how is this different from bulk loads in relational DBs or 
 materialized views.
 Looking forward to hear if others see a similar need to formalize conversion 
 of data to different storage formats.  If yes, are the details in the PDF 
 document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

2014-10-15 Thread Rajat Venkatesh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173373#comment-14173373
 ] 

Rajat Venkatesh commented on HIVE-8467:
---

guaranteed to be the same is the real bugbear. WRT to managed tables or 
databases, this is a tractable problem. Typically one can augment DML plans to 
keep the materialized views in sync. A mechanism to invalidate views and 
refresh them in the background will also be required. 

When it comes to external tables, the situation is a lot more haphazard. Users 
add files, remove files or rewrite files and expect them to available when they 
query the table. Also data can change in partitions a few days old. For e.g. 
some 3rd party data providers will send corrections after 3 days. In such a 
situation, the only way I can think of to guarantee that a view is synced is by 
scanning the directories. It will be great to hear if others have a better 
plan. So I've avoided the term materialized views to put the onus on the user 
to keep copies of external tables in sync. In that sense, table copy is 
complementary to materialized views. Use materialized views on managed tables 
and table copies on external tables.

Another factor is that we want to make these copies available to other 
execution engines and languages. In our case those are Presto, Pig and M/R. Use 
Hive to manage these copies and read it from others as well. This also means 
that we have to cater to the lowest common denominator. 

From your description of CBO, I think it should be relatively straight-forward 
to bring in Table Copies. Can Calcite make decisions at the partition level 
too ? We would like to handle situations when some partitions are not 
available in the copy.

 Table Copy - Background, incremental data load
 --

 Key: HIVE-8467
 URL: https://issues.apache.org/jira/browse/HIVE-8467
 Project: Hive
  Issue Type: New Feature
Reporter: Rajat Venkatesh
 Attachments: Table Copies.pdf


 Traditionally, Hive and other tools in the Hadoop eco-system havent required 
 a load stage. However, with recent developments, Hive is much more performant 
 when data is stored in specific formats like ORC, Parquet, Avro etc. 
 Technologies like Presto, also work much better with certain data formats. At 
 the same time, data is generated or obtained from 3rd parties in non-optimal 
 formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
 change the data format at the source. We've found that users either use 
 sub-optimal formats or spend a large amount of effort creating and 
 maintaining copies. We want to propose a new construct - Table Copy - to help 
 “load” data into an optimal storage format.
 I am going to attach a PDF document with a lot more details especially 
 addressing how is this different from bulk loads in relational DBs or 
 materialized views.
 Looking forward to hear if others see a similar need to formalize conversion 
 of data to different storage formats.  If yes, are the details in the PDF 
 document a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)