[Hadoop Wiki] Update of "Hive/Roadmap" by Ning Zhang

Apache Wiki Wed, 18 May 2011 10:56:37 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/Roadmap" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Roadmap?action=diff&rev1=41&rev2=42

--------------------------------------------------

   * [[https://issues.apache.org/jira/browse/HIVE-1538|Remove Duplicate 
Filters]]
   * [[https://issues.apache.org/jira/browse/HIVE-1644|Use Filter Pushdown for 
Automatically Accessing Indexes]]
   * [[https://issues.apache.org/jira/browse/HIVE-1803|Bitmap Index]]
-  * [[http://businesssoftwaresolution.org/contract-management-software/| 
Contract Management Software]]
   * [[https://issues.apache.org/jira/browse/HIVE-1790|HAVING clause support]]
-  * [[http://www.zetacleardetails.com| ZetaClear]]
   * [[https://issues.apache.org/jira/browse/HIVE-1517|Cross-database queries]]
  
  == Up For Grabs ==
  
-  * View improvements
+ Priorities are denoted as P0 > P1 ...
+ 
+ === Query Optimization ===
+  * [P0] Optimizing JOIN followed by GROUP BY
+   * A lot of analytics queries are JOINs followed by GROUP BY (join keys and 
group by keys may or may not be the same or related). We need a better 
optimization for this kind of query (optimize number of MapReduce Jobs vs. 
optimize data transfer size etc.)
+  * [P0] Optimize JOINs using Bloom Filters
+   * This is to optimize the case where two big tables are joined by the 
results are small. 
-  * Column-level statistics
+  * [P1] Column-level statistics
+   * We already have UDAFs for percentile, histogram etc. We need to figure 
out a smart way to compute column-level (approximate) stats in a streaming 
fashion (during/piggy-backing the scan/loading data process) without firing a 
new query. 
+  * [P1] Determining whether to use skew for a count distinct in group by
+   * Need column level stats or a sample task before the real query.
+  * [P1] Exploit clustering information in the input data
+   * Some data are generated using 'GROUP BY', 'CLUSTER BY', 'ORDER BY' etc. 
that implicit ordering in the data. We should exploit this metadata to do 
better join in choosing joins (e.g., map-side sort-merge join).
+  * Cost-based Optimization
+  
+ === Query Execution ===
+  * [P0] INSERT INTO statement
+   * Without INSERT INTO, user can only do INSERT OVERWRITE and this forces 
users to create a large number of partitions. INSERT INTO could help reducing 
the number of partitions.
+  * [P0] Block-leve merge
+   * After the INSERT statement finished, we merge small files into one 
depending on their average size. If it is smaller than a threshold, we launch 
another MR job to merge them. This MR job could be slow. We need to use the 
block-level merge for RCFile to significantly improve the merge process.
+  * [P0] Native Support for types of numbers, datetime, and IP addresses
+   * Currently most numbers, datetimes, and IP addresses are treated as 
strings. If we know their data types we should store these data types in a more 
effective manner (see also "Binary Storage and SerDe").
+  * [P0] Create a URL data type that is more friendly to data compression.
+   * URLs are very long and can contribute to large portion of storage. They 
are treated as generic strings and use string compression algorithms. We should 
investigate if there are ways to treate URL data types in a clever way so that 
the compression algorithm (may be a customized compression algorithm) can 
compress the data better in small enough overhead. 
+  * [P1] Binary storage format and SerDe
+   * The idea is to store data in their native format (rather than converting 
to UTF-8 string type) before stored on disk. This can potentially save a lot of 
CPU/IO cost in data conversion and object creations.
+   * Also investigating the compression technics used in other column-stores 
that does not require decompression before query (filtering). 
+   * Revisit the LazyBinarySerDe and see if they can be reused or extended to 
this storage format.
+   * Make this storage format amenable to mmap() (or FileChannel in Java) so 
that the system can skip I/O if memory random access can skip part of data 
(e.g., columns)
+  * Support for IN, exists and correlated subqueries
+  * More native types - Enums, timestamp
+  * Persistent UDF's
+ 
+ === Metadata Management ===
+  * [P0] Reducing the size of the metastore – whatever it takes: some ideas 
were to not store COLUMNS etc.
+   * One idea here is to introduce something like a schema object that 
contains the column objects. Partitions would inherit the schema object, which 
would only change when the table schema changes. Also work would be involved to 
migrate the existing setup. This would greatly reduce the number of rows in 
columns (current in the billions).
+  * [P1] View improvements
+ 
+ === Test, Error Messages and Debugging ===
-  * Heavy-duty test infrastructure
+  * [P0] Heavy-duty test infrastructure
   * Automated code coverage reports
   * Hive CLI improvement/Error messages
-  * [[http://www.dtidata.com/hard_drive_recovery.htm|Drive Recovery]]
   * HiveServer robustness
   * Debuggability / Resumability:
    * Show users the last portion of the data that caused the task to fail
    * Restart a job with a particular mapper (that failed earlier, for 
debugging purposes)
    * Resume at map-reduce job level.
-  * Support for Insert Appends
-  * Selection of 
[[http://aboutweddingreception.com/wedding-reception-flowers/| wedding 
reception flowers]]
-  * Support for IN, exists and correlated subqueries
-  * More native types - Enums, timestamp
-  * Resource about the [[http://eliteforum.org/| Elite Group]]
-  * Persistent UDF's
-  * Cost-based Optimization
   * SQL/OLAP
   * Storage handler improvements
   * System views
   * JDBC/ODBC improvements
   * mapred -> mapreduce transition
  
- See Also:
- [[http://www.abbreviationfor.com/| Abbreviations]]
-

[Hadoop Wiki] Update of "Hive/Roadmap" by Ning Zhang

Reply via email to