from:"Rajat Venkatesh \(JIRA\)"

[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

2014-10-16 Thread Rajat Venkatesh (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173534#comment-14173534
]

Rajat Venkatesh commented on HIVE-8467:
---

No they dont have to. The databases I know provide both options - sync on user
input or automatically. I am not confident we can support automatic sync on
external tables. Since it feels like a big feature gap, I chose a different
name.

Yes - we also have diffs we would like to contribute in other projects to use
Table Copy. Since the optimization is at the storage level, its very simple.
Replace partitions from the table copy when possible. Directories when it
comes to Pig or M/R. If materialized views are chosen, then the optimizers
have to mature in more or less lock step.

WRT to retention policy, the common case is to only keep the newest n
partitions limited by size of the copy. We didnt chose a date range. Sometimes
the date partition is not the top level one. This is a moving window. If older
partitions are accessed then it will fall back to reading partitions from the
Hive Table.

Table Copy - Background, incremental data load
--

Key: HIVE-8467
URL: https://issues.apache.org/jira/browse/HIVE-8467
Project: Hive
Issue Type: New Feature
Reporter: Rajat Venkatesh
Attachments: Table Copies.pdf

Traditionally, Hive and other tools in the Hadoop eco-system havent required
a load stage. However, with recent developments, Hive is much more performant
when data is stored in specific formats like ORC, Parquet, Avro etc.
Technologies like Presto, also work much better with certain data formats. At
the same time, data is generated or obtained from 3rd parties in non-optimal
formats such as CSV, tab-limited or JSON. Many a times, its not an option to
change the data format at the source. We've found that users either use
sub-optimal formats or spend a large amount of effort creating and
maintaining copies. We want to propose a new construct - Table Copy - to help
“load” data into an optimal storage format.
I am going to attach a PDF document with a lot more details especially
addressing how is this different from bulk loads in relational DBs or
materialized views.
Looking forward to hear if others see a similar need to formalize conversion
of data to different storage formats. If yes, are the details in the PDF
document a good start ?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8467) Table Copy - Background, incremental data load

2014-10-15 Thread Rajat Venkatesh (JIRA)

Rajat Venkatesh created HIVE-8467:
-

 Summary: Table Copy - Background, incremental data load
 Key: HIVE-8467
 URL: https://issues.apache.org/jira/browse/HIVE-8467
 Project: Hive
  Issue Type: New Feature
Reporter: Rajat Venkatesh


Traditionally, Hive and other tools in the Hadoop eco-system havent required a 
load stage. However, with recent developments, Hive is much more performant 
when data is stored in specific formats like ORC, Parquet, Avro etc. 
Technologies like Presto, also work much better with certain data formats. At 
the same time, data is generated or obtained from 3rd parties in non-optimal 
formats such as CSV, tab-limited or JSON. Many a times, its not an option to 
change the data format at the source. We've found that users either use 
sub-optimal formats or spend a large amount of effort creating and maintaining 
copies. We want to propose a new construct - Table Copy - to help “load” data 
into an optimal storage format.

I am going to attach a PDF document with a lot more details especially 
addressing how is this different from bulk loads in relational DBs or 
materialized views.

Looking forward to hear if others see a similar need to formalize conversion of 
data to different storage formats.  If yes, are the details in the PDF document 
a good start ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8467) Table Copy - Background, incremental data load

2014-10-15 Thread Rajat Venkatesh (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rajat Venkatesh updated HIVE-8467:
--
Attachment: Table Copies.pdf

Table Copy - Background, incremental data load
--

Key: HIVE-8467
URL: https://issues.apache.org/jira/browse/HIVE-8467
Project: Hive
Issue Type: New Feature
Reporter: Rajat Venkatesh
Attachments: Table Copies.pdf

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

2014-10-15 Thread Rajat Venkatesh (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173373#comment-14173373
]

Rajat Venkatesh commented on HIVE-8467:
---

guaranteed to be the same is the real bugbear. WRT to managed tables or
databases, this is a tractable problem. Typically one can augment DML plans to
keep the materialized views in sync. A mechanism to invalidate views and
refresh them in the background will also be required.

When it comes to external tables, the situation is a lot more haphazard. Users
add files, remove files or rewrite files and expect them to available when they
query the table. Also data can change in partitions a few days old. For e.g.
some 3rd party data providers will send corrections after 3 days. In such a
situation, the only way I can think of to guarantee that a view is synced is by
scanning the directories. It will be great to hear if others have a better
plan. So I've avoided the term materialized views to put the onus on the user
to keep copies of external tables in sync. In that sense, table copy is
complementary to materialized views. Use materialized views on managed tables
and table copies on external tables.

Another factor is that we want to make these copies available to other
execution engines and languages. In our case those are Presto, Pig and M/R. Use
Hive to manage these copies and read it from others as well. This also means
that we have to cater to the lowest common denominator.

From your description of CBO, I think it should be relatively straight-forward
to bring in Table Copies. Can Calcite make decisions at the partition level
too ? We would like to handle situations when some partitions are not
available in the copy.

Table Copy - Background, incremental data load
--

Key: HIVE-8467
URL: https://issues.apache.org/jira/browse/HIVE-8467
Project: Hive
Issue Type: New Feature
Reporter: Rajat Venkatesh
Attachments: Table Copies.pdf

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

[jira] [Created] (HIVE-8467) Table Copy - Background, incremental data load

[jira] [Updated] (HIVE-8467) Table Copy - Background, incremental data load

[jira] [Commented] (HIVE-8467) Table Copy - Background, incremental data load

4 matches

Site Navigation

Mail list logo

Footer information