Re: Load data from xml using Mapper.py in hive

Tomasz Domański Fri, 11 Jun 2010 08:36:30 -0700

Hi Shuja,

the answer seems to be in lines:


Caused by: java.io.IOException: Cannot run program
"sampleMapper.groovy": java.io.IOException: error=2, No such file or
directory


Hadoop can't see this file or can't run it.

1. make sure you added file correctly
2. check if hadoop can run script on your hadoop machines

Can you run this script in console on hadoop machine like

>sampleMapper.groovy

or you runn it:

> groovy sampleMapper.groovy

Mabe you should specify that groovy is needed to run your script.

try  to change your select into: "  ... using 'groovy sampleMapper.groovy'
... "


On 10 June 2010 14:01, Shuja Rehman <shujamug...@gmail.com> wrote:

> and on the link
>
> http://localhost:50030/jobfailures.jsp?jobid=job_201006101118_0009&kind=map&cause=failed
>
> i have found this output.
>
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row {"xmlfile":"*Hello*world"}
>
>       at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:171)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>
>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row {"xmlfile":"*Hello*world"}
>
>       at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:417)
>       at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:153)
>       ... 4 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot 
> initialize ScriptOperator
>
>       at 
> org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:319)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:456)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:696)
>
>       at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:456)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:696)
>
>       at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:45)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:456)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:696)
>
>       at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:400)
>       ... 5 more
> Caused by: java.io.IOException: Cannot run program "sampleMapper.groovy": 
> java.io.IOException: error=2, No such file or directory
>
>       at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
>       at 
> org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:279)
>       ... 14 more
> Caused by: java.io.IOException: java.io.IOException: error=2, No such file or 
> directory
>
>       at java.lang.UNIXProcess.(UNIXProcess.java:148)
>       at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>       at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
>       ... 15 more
>
>
>
>
> On Thu, Jun 10, 2010 at 1:57 PM, Shuja Rehman <shujamug...@gmail.com>wrote:
>
>> I have changes the logging level according to this command
>>
>> *bin/hive -hiveconf hive.root.logger=INFO,console *
>>
>> and the outout is
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------------
>> 10/06/10 13:51:20 INFO parse.ParseDriver: Parsing command: INSERT
>> OVERWRITE TABLE test_new
>>
>> SELECT
>>   TRANSFORM (xmlfile)
>>   USING 'sampleMapper.groovy'
>>   AS (b,c)
>> FROM test
>> 10/06/10 13:51:20 INFO parse.ParseDriver: Parse Completed
>> 10/06/10 13:51:20 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
>> 10/06/10 13:51:20 INFO parse.SemanticAnalyzer: Completed phase 1 of
>> Semantic Analysis
>> 10/06/10 13:51:20 INFO parse.SemanticAnalyzer: Get metadata for source
>> tables
>> 10/06/10 13:51:20 INFO metastore.HiveMetaStore: 0: Opening raw store with
>> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>> 10/06/10 13:51:20 INFO metastore.ObjectStore: ObjectStore, initialize
>> called
>> 10/06/10 13:51:20 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core"
>> requires "org.eclipse.core.resources" but it cannot be resolved.
>> 10/06/10 13:51:20 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core"
>> requires "org.eclipse.core.runtime" but it cannot be resolved.
>> 10/06/10 13:51:20 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core"
>> requires "org.eclipse.text" but it cannot be resolved.
>> 10/06/10 13:51:22 INFO metastore.ObjectStore: Initialized ObjectStore
>> 10/06/10 13:51:22 INFO metastore.HiveMetaStore: 0: get_table : db=default
>> tbl=test
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test { string xmlfile}
>> 10/06/10 13:51:23 INFO parse.SemanticAnalyzer: Get metadata for subqueries
>> 10/06/10 13:51:23 INFO parse.SemanticAnalyzer: Get metadata for
>> destination tables
>> 10/06/10 13:51:23 INFO metastore.HiveMetaStore: 0: get_table : db=default
>> tbl=test_new
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test_new { string b, string
>> c}
>> 10/06/10 13:51:23 INFO parse.SemanticAnalyzer: Completed getting MetaData
>> in Semantic Analysis
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test_new { string b, string
>> c}
>> 10/06/10 13:51:23 INFO ppd.OpProcFactory: Processing for FS(3)
>> 10/06/10 13:51:23 INFO ppd.OpProcFactory: Processing for SCR(2)
>> 10/06/10 13:51:23 INFO ppd.OpProcFactory: Processing for SEL(1)
>> 10/06/10 13:51:23 INFO ppd.OpProcFactory: Processing for TS(0)
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test { string xmlfile}
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test { string xmlfile}
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test { string xmlfile}
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test { string xmlfile}
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test { string xmlfile}
>> 10/06/10 13:51:23 INFO hive.log: DDL: struct test { string xmlfile}
>> 10/06/10 13:51:23 INFO parse.SemanticAnalyzer: Completed plan generation
>> 10/06/10 13:51:23 INFO ql.Driver: Semantic Analysis Completed
>> 10/06/10 13:51:23 INFO ql.Driver: Returning Hive schema:
>> Schema(fieldSchemas:[FieldSchema(name:b, type:string, comment:null),
>> FieldSchema(name:c, type:string, comment:null)], properties:null)
>> 10/06/10 13:51:23 INFO ql.Driver: query plan =
>> file:/tmp/root/hive_2010-06-10_13-51-20_112_5091815325633732890/queryplan.xml
>> 10/06/10 13:51:24 INFO ql.Driver: Starting command: INSERT OVERWRITE TABLE
>> test_new
>>
>> SELECT
>>   TRANSFORM (xmlfile)
>>   USING 'sampleMapper.groovy'
>>   AS (b,c)
>> FROM test
>> Total MapReduce jobs = 2
>> 10/06/10 13:51:24 INFO ql.Driver: Total MapReduce jobs = 2
>> Launching Job 1 out of 2
>> 10/06/10 13:51:24 INFO ql.Driver: Launching Job 1 out of 2
>> Number of reduce tasks is set to 0 since there's no reduce operator
>> 10/06/10 13:51:24 INFO exec.ExecDriver: Number of reduce tasks is set to 0
>> since there's no reduce operator
>> 10/06/10 13:51:24 INFO exec.ExecDriver: Using
>> org.apache.hadoop.hive.ql.io.HiveInputFormat
>> 10/06/10 13:51:24 INFO exec.ExecDriver: Processing alias test
>> 10/06/10 13:51:24 INFO exec.ExecDriver: Adding input file
>> hdfs://localhost:9000/user/hive/warehouse/test
>> 10/06/10 13:51:24 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same.
>> 10/06/10 13:51:24 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> Starting Job = job_201006101118_0009, Tracking URL =
>> http://localhost:50030/jobdetails.jsp?jobid=job_201006101118_0009
>> 10/06/10 13:51:25 INFO exec.ExecDriver: Starting Job =
>> job_201006101118_0009, Tracking URL =
>> http://localhost:50030/jobdetails.jsp?jobid=job_201006101118_0009
>> Kill Command = /usr/local/hadoop/hadoop-0.20.2/bin/../bin/hadoop job
>> -Dmapred.job.tracker=localhost:9001 -kill job_201006101118_0009
>> 10/06/10 13:51:25 INFO exec.ExecDriver: Kill Command =
>> /usr/local/hadoop/hadoop-0.20.2/bin/../bin/hadoop job
>> -Dmapred.job.tracker=localhost:9001 -kill job_201006101118_0009
>> 2010-06-10 13:51:32,255 Stage-1 map = 0%,  reduce = 0%
>> 10/06/10 13:51:32 INFO exec.ExecDriver: 2010-06-10 13:51:32,255 Stage-1
>> map = 0%,  reduce = 0%
>> 2010-06-10 13:51:35,305 Stage-1 map = 50%,  reduce = 0%
>> 10/06/10 13:51:35 INFO exec.ExecDriver: 2010-06-10 13:51:35,305 Stage-1
>> map = 50%,  reduce = 0%
>> 2010-06-10 13:51:58,505 Stage-1 map = 100%,  reduce = 100%
>> 10/06/10 13:51:58 INFO exec.ExecDriver: 2010-06-10 13:51:58,505 Stage-1
>> map = 100%,  reduce = 100%
>> Ended Job = job_201006101118_0009 with errors
>> 10/06/10 13:51:58 ERROR exec.ExecDriver: Ended Job = job_201006101118_0009
>> with errors
>>
>> Task with the most failures(4):
>> -----
>> Task ID:
>>   task_201006101118_0009_m_000000
>>
>> URL:
>>
>> http://localhost:50030/taskdetails.jsp?jobid=job_201006101118_0009&tipid=task_201006101118_0009_m_000000
>> -----
>>
>> 10/06/10 13:51:58 ERROR exec.ExecDriver:
>> Task with the most failures(4):
>> -----
>> Task ID:
>>   task_201006101118_0009_m_000000
>>
>> URL:
>>
>> http://localhost:50030/taskdetails.jsp?jobid=job_201006101118_0009&tipid=task_201006101118_0009_m_000000
>> -----
>>
>>
>> FAILED: Execution Error, return code 2 from
>> org.apache.hadoop.hive.ql.exec.ExecDriver
>> 10/06/10 13:51:58 ERROR ql.Driver: FAILED: Execution Error, return code 2
>> from org.apache.hadoop.hive.ql.exec.ExecDriver
>>
>>
>> -----------------------------------------------------------------------------------------------------------------------------
>>
>> Any clue???
>>
>>
>> On Thu, Jun 10, 2010 at 1:43 PM, Sonal Goyal <sonalgoy...@gmail.com>wrote:
>>
>>> Can you try changing your logging level to debug and see the exact
>>> error message in hive.log?
>>>
>>> Thanks and Regards,
>>> Sonal
>>> www.meghsoft.com
>>> http://in.linkedin.com/in/sonalgoyal
>>>
>>>
>>>
>>> On Thu, Jun 10, 2010 at 5:07 PM, Shuja Rehman <shujamug...@gmail.com>
>>> wrote:
>>> > Hi
>>> > I have try to do as you described. Let me explain in steps.
>>> >
>>> > 1- create table test (xmlFile String);
>>> >
>>> ----------------------------------------------------------------------------------
>>> >
>>> > 2-LOAD DATA LOCAL INPATH '1.xml'
>>> > OVERWRITE INTO TABLE test;
>>> >
>>> ----------------------------------------------------------------------------------
>>> >
>>> > 3-CREATE TABLE test_new (
>>> >     b STRING,
>>> >     c STRING
>>> >   )
>>> > ROW FORMAT DELIMITED
>>> > FIELDS TERMINATED BY '\t';
>>> >
>>> >
>>> ----------------------------------------------------------------------------------
>>> > 4-add FILE sampleMapper.groovy;
>>> >
>>> ----------------------------------------------------------------------------------
>>> > 5- INSERT OVERWRITE TABLE test_new
>>> > SELECT
>>> >   TRANSFORM (xmlfile)
>>> >   USING 'sampleMapper.groovy'
>>> >   AS (b,c)
>>> > FROM test;
>>> >
>>> ----------------------------------------------------------------------------------
>>> > XML FILE:
>>> > xml file has only one row for testing purpose which is
>>> >
>>> > <xy><a><b>Hello</b><c>world</c></a></xy>
>>> >
>>> ----------------------------------------------------------------------------------
>>> > MAPPER
>>> > and i have write the mapper in groovy to parse it. the mapper is
>>> >
>>> >    def xmlData =""
>>> >  System.in.withReader {
>>> >         xmlData=xmlData+ it.readLine()
>>> > }
>>> >
>>> > def xy = new XmlParser().parseText(xmlData)
>>> > def b=xy.a.b.text()
>>> >     def c=xy.a.c.text()
>>> >     println  ([b,c].join('\t') )
>>> >
>>> ----------------------------------------------------------------------------------
>>> > Now step 1-4 are fine but when i perform step 5 which will load the
>>> data
>>> > from test table to new table using mapper, it throws the error. The
>>> error on
>>> > console is
>>> >
>>> > FAILED: Execution Error, return code 2 from
>>> > org.apache.hadoop.hive.ql.exec.ExecDriver
>>> >
>>> > I am facing hard time. Any suggestions
>>> > Thanks
>>> >
>>> > On Thu, Jun 10, 2010 at 3:05 AM, Ashish Thusoo <athu...@facebook.com>
>>> wrote:
>>> >>
>>> >> You could load this whole xml file into a table with a single row and
>>> a
>>> >> single column. The default record delimiter is \n but you can create a
>>> table
>>> >> where the record delimiter is \001. Once you do that you can follow
>>> the
>>> >> approach that you described below. Will this solve your problem?
>>> >>
>>> >> Ashish
>>> >> ________________________________
>>> >> From: Shuja Rehman [mailto:shujamug...@gmail.com]
>>> >> Sent: Wednesday, June 09, 2010 3:07 PM
>>> >> To: hive-user@hadoop.apache.org
>>> >> Subject: Load data from xml using Mapper.py in hive
>>> >>
>>> >> Hi
>>> >> I have created a table in hive (Suppose table1 with two columns, col1
>>> and
>>> >> col2 )
>>> >>
>>> >> now i have an xml file for which i have write a python script which
>>> read
>>> >> the xml file and transform it in single row with tab seperated
>>> >> e.g the output of python script can be
>>> >>
>>> >> row 1 = val1     val2
>>> >> row2 =  val3     val4
>>> >>
>>> >> so the output of file has straight rows with the help of python
>>> script.
>>> >> now i want to load this into created table. I have seen the example of
>>> in
>>> >> which the data is first loaded in u_data table then transform it using
>>> >> python script in u_data_new but in m scenario. it does not fit as i
>>> have xml
>>> >> file as source.
>>> >>
>>> >>
>>> >> Kindly let me know can I achieve this??
>>> >> Thanks
>>> >>
>>> >> --
>>> >
>>> > --
>>> > Regards
>>> > Baig
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Regards
>> Shuja-ur-Rehman Baig
>> _________________________________
>> MS CS - School of Science and Engineering
>> Lahore University of Management Sciences (LUMS)
>> Sector U, DHA, Lahore, 54792, Pakistan
>> Cell: +92 3214207445
>>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> _________________________________
> MS CS - School of Science and Engineering
> Lahore University of Management Sciences (LUMS)
> Sector U, DHA, Lahore, 54792, Pakistan
> Cell: +92 3214207445
>

Re: Load data from xml using Mapper.py in hive

Reply via email to