[jira] [Updated] (MAPREDUCE-7450) Set the record delimiter for the input file based on its path

lvhu (Jira) Wed, 09 Aug 2023 20:16:09 -0700


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


lvhu updated MAPREDUCE-7450:
----------------------------
    Priority: Critical  (was: Blocker)

> Set the record delimiter for the input file based on its path
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-7450
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7450
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 3.3.6
>         Environment: Any
>            Reporter: lvhu
>            Priority: Critical
>             Fix For: MR-3902
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> In the mapreduce program, when reading files, we can easily set the record 
> delimiter based on the parameter textinputformat.record.delimiter.
> This parameter can also be easily set, including Spark, for example:
> spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter",
>  "|@|")
> val rdd = spark.sparkContext.newAPIHadoopFile(...)
> But once the textinputformat.record.delimiter parameter is modified, it will 
> take effect for all files. In actual scenarios, different files often have 
> different delimiters.
> In Hive, as Hive does not support programming, we cannot modify the record 
> delimiter through the above methods. If modified through a configuration 
> file, it will take effect on all Hive tables.
> The only way to modify record delimiter in hive is to rewrite a 
> TextInputFormat class.
> The current method of hive is as follows:
> package abc.hive.MyFstTextInputFormat
> public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text> 
> implements JobConfigurable {
>  ...
> }
> create table test  (  
>     id string,  
>     name string  
> )  stored as  
> INPUTFORMAT 'abc.hive.MyFstTextInputFormat'  
> If there are multiple different record delimiters, multiple TextInputFormats 
> need to be rewritten.
> My idea is to modify TextInputFormat class to support setting record 
> delimiter for input files based on the prefix of the file path.
> The specific idea is to make the following modifications to TextInputFormat:
> public class TextInputFormat extends FileInputFormat<LongWritable, Text>
>   implements JobConfigurable {
>   ....
>   public RecordReader<LongWritable, Text> getRecordReader(
>                                           InputSplit genericSplit, JobConf 
> job,
>                                           Reporter reporter)
>     throws IOException {
>     
>     reporter.setStatus(genericSplit.toString());
>     // default delimiter
>     String delimiter = job.get("textinputformat.record.delimiter");
>     //Obtain the path of the file
>     String filePath = genericSplit.getPath().toUri().getPath();
>     //Obtain a list of file paths and delimiter relationships based on 
> configuration file parameters
>     Map pathToDelimiterMap = //Obtain by parsing the configuration file
>     for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
>        //config path
>        String configPath = entry.getKey();
>        //if configPath is the prefix of filePath
>        if(filePath.startsWith(configPath)){
>          //Set delimiter corresponding to the file path
>          delimiter = entry.getValue();
>        }   
>     });
>     byte[] recordDelimiterBytes = null;
>     if (null != delimiter) {
>       recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
>     }
>     return new LineRecordReader(job, (FileSplit) genericSplit,
>         recordDelimiterBytes);
>   }
> }
> After implementing the record delimiter function of setting input files 
> according to the path, not only does it save code to modify the delimiter, 
> but it is also very convenient for Hadoop and Spark, without frequent 
> parameter configuration modifications.
> If you accept my idea, I hope you can assign the task to me. My Github 
> account is: lvhu-goodluck
> I really hope to contribute code to the community.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (MAPREDUCE-7450) Set the record delimiter for the input file based on its path

Reply via email to