[ https://issues.apache.org/jira/browse/PIG-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jacob Tolar updated PIG-5432: ----------------------------- Description: OrcStorage needs to detect the schema of input data paths. If some data paths have no ORC files (perhaps only a _SUCCESS marker is present), this will fail. For example: {code} A = LOAD '/path/to/20230101,/path/to/20230102' USING OrcStorage(); {code} If {{/path/to/20230101}} contains only a _SUCCESS marker and {{20230102}} contains data, OrcStorage fails to detect the schema. The code tries to use a search algorithm to recursively search through all input paths for the data (via Utils.depthFirstSearchForFile), but it is implemented incorrectly and returns early in this scenario. See: https://github.com/apache/pig/blob/c0d75ba930f9aa5c6454d0264a96f82b45279202/src/org/apache/pig/builtin/OrcStorage.java#L389-L408 https://github.com/apache/pig/blob/59ec4a326079c9f937a052194405415b1e3a2b06/src/org/apache/pig/impl/util/Utils.java#L629-L667 I'll attach a patch. was: OrcStorage needs to detect the schema of input data paths. If some data paths have no files this will fail. For example: {code} A = LOAD '/path/to/20230101,/path/to/20230102' USING OrcStorage(); {code} If {{/path/to/20230101}} contains only a _SUCCESS marker and {{20230102}} contains data, OrcStorage fails to detect the schema. The code tries to use a search algorithm to recursively search through all input paths for the data (via Utils.depthFirstSearchForFile), but it is implemented incorrectly and returns early in this scenario. See: https://github.com/apache/pig/blob/c0d75ba930f9aa5c6454d0264a96f82b45279202/src/org/apache/pig/builtin/OrcStorage.java#L389-L408 https://github.com/apache/pig/blob/59ec4a326079c9f937a052194405415b1e3a2b06/src/org/apache/pig/impl/util/Utils.java#L629-L667 I'll attach a patch. > OrcStorage fails to detect schema in some cases > ----------------------------------------------- > > Key: PIG-5432 > URL: https://issues.apache.org/jira/browse/PIG-5432 > Project: Pig > Issue Type: Bug > Reporter: Jacob Tolar > Priority: Minor > Attachments: PIG-5432.v01.patch > > > OrcStorage needs to detect the schema of input data paths. If some data paths > have no ORC files (perhaps only a _SUCCESS marker is present), this will > fail. > For example: > {code} > A = LOAD '/path/to/20230101,/path/to/20230102' USING OrcStorage(); > {code} > If {{/path/to/20230101}} contains only a _SUCCESS marker and {{20230102}} > contains data, OrcStorage fails to detect the schema. > The code tries to use a search algorithm to recursively search through all > input paths for the data (via Utils.depthFirstSearchForFile), but it is > implemented incorrectly and returns early in this scenario. > See: > https://github.com/apache/pig/blob/c0d75ba930f9aa5c6454d0264a96f82b45279202/src/org/apache/pig/builtin/OrcStorage.java#L389-L408 > https://github.com/apache/pig/blob/59ec4a326079c9f937a052194405415b1e3a2b06/src/org/apache/pig/impl/util/Utils.java#L629-L667 > I'll attach a patch. -- This message was sent by Atlassian Jira (v8.20.10#820010)