----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/15142/#review27948 -----------------------------------------------------------
Ship it! +1, this is awesome work Maja and will fail faster due to metastore issues and also cut back on metastore accesses. Yay! giraph-hive/src/main/java/org/apache/giraph/hive/HiveGiraphRunner.java <https://reviews.apache.org/r/15142/#comment54396> Maybe worth adding a top level comment for this method that says something like: For all Hive vertex inputs, add the user settings to the configuration. Additionally, this checks the input specs for every input which caches metadata access into the configuration to eliminate worker access to the metastore and fail earlier in the case that metadata doesn't exist. In the case of multiple vertex input descriptions, metadata is cached in each vertex input format description and then saved into a single Configuration via JSON. giraph-hive/src/main/java/org/apache/giraph/hive/HiveGiraphRunner.java <https://reviews.apache.org/r/15142/#comment54399> Maybe worth adding a top level comment for this method that says something like: For all Hive edge inputs, add the user settings to the configuration. Additionally, this checks the input specs for every input which caches metadata access into the configuration to eliminate worker access to the metastore and fail earlier in the case that metadata doesn't exist. In the case of multiple edge input descriptions, metadata is cached in each vertex input format description and then saved into a single Configuration via JSON. - Avery Ching On Oct. 31, 2013, 6:43 p.m., Maja Kabiljo wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/15142/ > ----------------------------------------------------------- > > (Updated Oct. 31, 2013, 6:43 p.m.) > > > Review request for giraph. > > > Bugs: GIRAPH-789 > https://issues.apache.org/jira/browse/GIRAPH-789 > > > Repository: giraph-git > > > Description > ------- > > Currently each worker is sending multiple requests to metastore to get info > about io formats, which is unnecessary and can cause issues when metastore is > having problems. > > Hive-io changed so it doesn't access metastore when schema/table info is > already present in Configuration, and HiveGiraphRunner is now initializing > all the formats to fill up the Configuration. If HiveGiraphRunner is not used > everything will still work, but we'll have accesses to metastore from workers. > > > Diffs > ----- > > giraph-hive/src/main/java/org/apache/giraph/hive/HiveGiraphRunner.java > 6b8a8e9 > giraph-hive/src/main/java/org/apache/giraph/hive/common/HiveUtils.java > b809413 > > giraph-hive/src/main/java/org/apache/giraph/hive/input/edge/HiveEdgeInputFormat.java > 534a773 > > giraph-hive/src/main/java/org/apache/giraph/hive/input/vertex/HiveVertexInputFormat.java > d5c1279 > > giraph-hive/src/main/java/org/apache/giraph/hive/output/HiveVertexOutputFormat.java > c4813fb > pom.xml f2981ff > > Diff: https://reviews.apache.org/r/15142/diff/ > > > Testing > ------- > > mvn clean verify > > Run jobs with single and multiple input formats, with added logging for each > metastore call in hive-io. For example in case when we have single vertex and > edge input and output, we'll have none instead of 8 metastore calls from each > worker. The number of calls from master is also reduced - we are only getting > input partition descriptions in the beginning of the job and have no calls in > the end (for output). The only call left in the end is from cleanup task to > register new partition. Clean up task used to have two additional calls which > are also removed. > > > Thanks, > > Maja Kabiljo > >
