[
https://issues.apache.org/jira/browse/HIVE-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028257#comment-16028257
]
Marta Kuczora edited comment on HIVE-16784 at 5/29/17 12:02 PM:
----------------------------------------------------------------
In the LineageState.setLineage method we get the file sink operator for the
path:
{noformat}
public void setLineage(Path dir, DataContainer dc,
List<FieldSchema> cols) {
// First lookup the file sink operator from the load work.
Operator<?> op = dirToFop.get(dir);
// Go over the associated fields and look up the dependencies
// by position in the row schema of the filesink operator.
if (op == null) {
return;
}
List<ColumnInfo> signature = op.getSchema().getSignature();
int i = 0;
for (FieldSchema fs : cols) {
linfo.putDependency(dc, fs, index.getDependency(op, signature.get(i++)));
}
}
{noformat}
The reason why the lineage information is missing from the out file is that the
dirToFop map doesn't contain the given path.
This map is created in the SemanticAnalyzer.genFileSinkPlan method:
{noformat}
if (ltd != null && SessionState.get() != null) {
SessionState.get().getLineageState()
.mapDirToFop(ltd.getSourcePath(), output);
}
{noformat}
The path used here doesn't match with the patch used in the
LineageState.setLineage method. The difference is in the file name, the map
contains the path for the file "-ext-10000", but the path in the LineageState
points to the "-ext-10002" file.
was (Author: kuczoram):
In the LineageState.setLineage method we get the file sink operator for the
path:
{noformat}
public void setLineage(Path dir, DataContainer dc,
List<FieldSchema> cols) {
// First lookup the file sink operator from the load work.
Operator<?> op = dirToFop.get(dir);
// Go over the associated fields and look up the dependencies
// by position in the row schema of the filesink operator.
if (op == null) {
return;
}
List<ColumnInfo> signature = op.getSchema().getSignature();
int i = 0;
for (FieldSchema fs : cols) {
linfo.putDependency(dc, fs, index.getDependency(op, signature.get(i++)));
}
}
{noformat}
The reason why the lineage information is missing from the out file is that the
dirToFop map doesn't contain the given path.
This map is created in the SemanticAnalyzer.genFileSinkPlan method:
{noformat}
if (ltd != null && SessionState.get() != null) {
SessionState.get().getLineageState()
.mapDirToFop(ltd.getSourcePath(), (FileSinkOperator) output);
}
{noformat}
The path used here doesn't match with the patch used in the
LineageState.setLineage method. The difference is in the file name, the map
contains the path for the file "-ext-10000", but the path in the LineageState
points to the "-ext-10002" file.
> Missing lineage information when hive.blobstore.optimizations.enabled is true
> -----------------------------------------------------------------------------
>
> Key: HIVE-16784
> URL: https://issues.apache.org/jira/browse/HIVE-16784
> Project: Hive
> Issue Type: Bug
> Reporter: Marta Kuczora
>
> Running the commands of the add_part_multiple.q test on S3 with
> hive.blobstore.optimizations.enabled=true fails because of missing lineage
> information.
> Running the command on HDFS
> {noformat}
> from src TABLESAMPLE (1 ROWS)
> insert into table add_part_test PARTITION (ds='2010-01-01') select 100,100
> insert into table add_part_test PARTITION (ds='2010-02-01') select 200,200
> insert into table add_part_test PARTITION (ds='2010-03-01') select 400,300
> insert into table add_part_test PARTITION (ds='2010-04-01') select 500,400;
> {noformat}
> results the following posthook outputs
> {noformat}
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-01-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-01-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-02-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-02-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-03-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-03-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-04-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-04-01).value EXPRESSION []
> {noformat}
> These lines are not printed when running the command on the table located in
> S3.
> If hive.blobstore.optimizations.enabled=false, the lineage information is
> printed.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)