[
https://issues.apache.org/jira/browse/MAPREDUCE-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
lvhu updated MAPREDUCE-7450:
----------------------------
Priority: Critical (was: Blocker)
> Set the record delimiter for the input file based on its path
> -------------------------------------------------------------
>
> Key: MAPREDUCE-7450
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7450
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: client
> Affects Versions: 3.3.6
> Environment: Any
> Reporter: lvhu
> Priority: Critical
> Fix For: MR-3902
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> In the mapreduce program, when reading files, we can easily set the record
> delimiter based on the parameter textinputformat.record.delimiter.
> This parameter can also be easily set, including Spark, for example:
> spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter",
> "|@|")
> val rdd = spark.sparkContext.newAPIHadoopFile(...)
> But once the textinputformat.record.delimiter parameter is modified, it will
> take effect for all files. In actual scenarios, different files often have
> different delimiters.
> In Hive, as Hive does not support programming, we cannot modify the record
> delimiter through the above methods. If modified through a configuration
> file, it will take effect on all Hive tables.
> The only way to modify record delimiter in hive is to rewrite a
> TextInputFormat class.
> The current method of hive is as follows:
> package abc.hive.MyFstTextInputFormat
> public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text>
> implements JobConfigurable {
> ...
> }
> create table test (
> id string,
> name string
> ) stored as
> INPUTFORMAT 'abc.hive.MyFstTextInputFormat'
> If there are multiple different record delimiters, multiple TextInputFormats
> need to be rewritten.
> My idea is to modify TextInputFormat class to support setting record
> delimiter for input files based on the prefix of the file path.
> The specific idea is to make the following modifications to TextInputFormat:
> public class TextInputFormat extends FileInputFormat<LongWritable, Text>
> implements JobConfigurable {
> ....
> public RecordReader<LongWritable, Text> getRecordReader(
> InputSplit genericSplit, JobConf
> job,
> Reporter reporter)
> throws IOException {
>
> reporter.setStatus(genericSplit.toString());
> // default delimiter
> String delimiter = job.get("textinputformat.record.delimiter");
> //Obtain the path of the file
> String filePath = genericSplit.getPath().toUri().getPath();
> //Obtain a list of file paths and delimiter relationships based on
> configuration file parameters
> Map pathToDelimiterMap = //Obtain by parsing the configuration file
> for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
> //config path
> String configPath = entry.getKey();
> //if configPath is the prefix of filePath
> if(filePath.startsWith(configPath)){
> //Set delimiter corresponding to the file path
> delimiter = entry.getValue();
> }
> });
> byte[] recordDelimiterBytes = null;
> if (null != delimiter) {
> recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
> }
> return new LineRecordReader(job, (FileSplit) genericSplit,
> recordDelimiterBytes);
> }
> }
> After implementing the record delimiter function of setting input files
> according to the path, not only does it save code to modify the delimiter,
> but it is also very convenient for Hadoop and Spark, without frequent
> parameter configuration modifications.
> If you accept my idea, I hope you can assign the task to me. My Github
> account is: lvhu-goodluck
> I really hope to contribute code to the community.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]