Hello,
Tested with Pig 0.12.1 and Pig 0.14.0
I write here with not much hope, but maybe I have luck and someone knows
how to solve it :)
I am writing an Storage for Gora, and if I use an outer bag inside a
foreach when storing I get java.lang.StackOverflowError .
Exactly this:
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. null
java.lang.StackOverflowError
at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:379)
at org.apache.pig.impl.util.Utils.mergeCollection(Utils.java:441)
at
org.apache.pig.newplan.DependencyOrderWalker.doAllPredecessors(DependencyOrderWalker.java:84)
at
org.apache.pig.newplan.DependencyOrderWalker.doAllPredecessors(DependencyOrderWalker.java:88)
at
org.apache.pig.newplan.DependencyOrderWalker.doAllPredecessors(DependencyOrderWalker.java:88)
(fill 1030 lines of log with this last line)
When doing a dump or using PigStorage all works perfectly, so the problem
is surely in my Storage implementation.
The script is as follows:
borrar_areas_table = LOAD '.'
USING org.apache.gora.pig.GoraStorage(
'java.lang.String',
'es.indra.innovationlabs.celtic.generated.BorrarAreas',
'nombre') ;
borrar_areas = FOREACH borrar_areas_table GENERATE key ;
borrar_areas_bag = GROUP borrar_areas ALL ;
-- [2] - Borrar de webpage:
-- experta: map <area> -> record = hashmap,
-- y areas: array <areas> = bag
webpage = LOAD '.'
USING org.apache.gora.pig.GoraStorage(
'java.lang.String',
'org.apache.nutch.storage.WebPage',
'experta, areas') ;
-- Seleccionar aquellas páginas que contienen en <areas> alguna de las
áreas a borrar (en borrar_areas_bag.borrar_areas)
webpage_match = FILTER webpage BY bagContainsFB(areas,
borrar_areas_bag.borrar_areas) ;
-- Borrar las áreas (bag) y las claves en experta (map)
webpage_fix = FOREACH webpage_match
GENERATE key, deleteMapKeys(experta,
borrar_areas_bag.borrar_areas) as experta,
SUBTRACT(areas, borrar_areas_bag.borrar_areas)
as areas ;
STORE webpage_fix INTO '.' USING org.apache.gora.pig.GoraStorage(
'java.lang.String',
'org.apache.nutch.storage.WebPage',
'experta, areas') ;
I have to do a workaround in order to get things done, avoiding using
borrar_areas_bag.borrar_areas and using a cross instead, but the execution
is noticeably slower:
borrar_areas_table = LOAD '.'
USING org.apache.gora.pig.GoraStorage(
'java.lang.String',
'es.indra.innovationlabs.celtic.generated.BorrarAreas',
'nombre') ;
borrar_areas = FOREACH borrar_areas_table GENERATE key ;
borrar_areas_bag = GROUP borrar_areas ALL ;
-- [2] - Borrar de webpage: experta: map <area> -> record = hashmap, y
areas: array <areas> = bag
webpage = LOAD '.'
USING org.apache.gora.pig.GoraStorage(
'java.lang.String',
'org.apache.nutch.storage.WebPage',
'experta, areas') ;
webpage_cross_areas = CROSS webpage, borrar_areas_bag ;
-- Seleccionar aquellas páginas que contienen en <areas> alguna de las
áreas a borrar (en borrar_areas_bag::borrar_areas)
webpage_match = FILTER webpage_cross_areas BY
bagContainsFB(webpage::areas, borrar_areas_bag::borrar_areas) ;
-- Borrar las áreas (bag) y las claves en experta (map)
webpage_fix = FOREACH webpage_match
GENERATE webpage::key AS key,
deleteMapKeys(experta,
borrar_areas_bag::borrar_areas) as experta,
SUBTRACT(areas,
borrar_areas_bag::borrar_areas) as areas:{(chararray)} ;
STORE webpage_fix INTO '.' USING org.apache.gora.pig.GoraStorage(
'java.lang.String',
'org.apache.nutch.storage.WebPage',
'experta, areas') ;
The actual question is: Does anyone think about something if I ask about
that case?: outerbag in a foreach, Storage, dependecies, ...
Any possible method that I should implement? Is related with some schema?
I know is a quite nonsense question, so I don't expect any idea :( but
thanks! :)
Regards,
Alfonso Nishikawa