Clearly this is a flexible way to sidestep OOB-secondpass or custom JDBC but my inquiry was more on philosophical grounds where conceptually I prefer to not mix the output specification in the actual query. W/o using INSERT won't this also have to print something for every row? Doesn't this approach allow export of records before the query is known to have fully suceeded such that you could have exported only partial results before subsequent Hive/Hadoop failure w/o any rollback oversight or control knowledge?
On Jul 14, 2009 3:19 PM, "Edward Capriolo" <[email protected]> wrote: On Tue, Jul 14, 2009 at 5:40 PM, Frederick Oko<[email protected]> wrote: > Is this just a work... Frederick, A bulk operation would perform better. I was going to build a more complete version and start a Jira on it, I guess I let the cat out of the bag on this one. My Notes: My DataFlow looks like this.. RawWebLogs (5 MInute interval) -> Hive raw_web_log partition by date and hour Every hour I want to summarize the previous hour for fast access. These summaries might be used directly by a web-application. For example I want a summary of http_status and count. This would have from 5-10 rows. 200 65300 304 4000 404 1003 A medium summary like a page count may look like this and have <50,000 rows /page1 9000 /page2 5454 -> Large Summaries > 100,000 rows raw-> Medium Summaries < 100,000 rows -> Small Summaries < 1,000 rows Small Summaries: Might be too small for a partition, but could be justified. Loading into a relational database should be fast Operations like sorting could be handled in a client browser/AJAX Medium Summaries: Might be too small for a partition, but could be justified. Loading into a relational database should be fast operations like sorting column pruning should be handled server side Large Summaries: Should be in a partition. Loading into a relational database should be throttled. Operations like sorting or column pruning might need to be done by another summary or should be pre-sorted Currently I am dropping these summaries into a partition of another table, some are very very small. raw_web_log_http_status partition date, partition hour The UDF would be targeting small & medium summaries. Performance should not be an issue since the number of inserts should be small. Also I think if performance is an issue the jdbc driver could be a connection pool. Large inserts would likely have issues that custom grammar or UDF could not address either, like throttling, replication, and shards. My goal is to be able to dump my summaries into a hive partition and/or a mysql table. This way my front end web server does not need to have exotic hadoop/hive access to show people stats. Do you think I am on the right track? Edward
