Hi Zhiwei, Max, I just ran names_and_entities.pig locally, and got the same error you had, on a dump with 1000 articles. I fixed the error by doing two things:
1) added this line to the script: SET mapred.child.java.opts '-Xmx2048m' 2) _commented out_ this line: --set io.sort.mb 1024 I think the second point is what fixed it. My guess is that this parameter causes the heap space error when you're running locally. Another note: you'll probably want to maintain separate .params files for local tests and running on a real cluster. I'm attaching part of the output from names_and_entities.pig. Zhiwei, please shoot me an email if you have any more issues like this. Jo and I have wrestled with quite a few of them :-), and there's obviously still a lot to fix. Cheers, Chris On Mon, Apr 22, 2013 at 3:08 PM, Cai Zhiwei <[email protected]> wrote: > Hi Max, > > My test wikipedia dump file contains only two pages! My computer has 5g > memory and it run out in a few seconds. Is that reasonable? I only want to > see the ouput file and know more about details. I also tried > the names_and_entities_low_memory.pig version,but still failed. My test > dump file is attached. > Pig version:0.11.1 > Operation System:Ubuntu 12.04 > > Max,if you have time,could you please run names_and_entities.pig with this > file and send the output file to me? That will help me get a better > understanding of index pipeline. > > > Best regards, > Zhiwei > > > On Mon, Apr 22, 2013 at 8:26 PM, Max Jakob <[email protected]> wrote: > >> Hi Zhiwei, [CCing dbp-spotlight-developers mailing list] >> >> On Mon, Apr 22, 2013 at 1:30 PM, Cai Zhiwei <[email protected]> >> wrote: >> > I tried to index English data set with pignlproc but got stuck on this >> step >> > for a whole day.I used very small dumps file and tried every method >> > mentioned at [1] but the problem still couldn't be solved. >> > >> > [1]https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/165 >> >> Yes, this is still an important open issue. There is a version that >> does a lot in memory and sometimes RAM is not enough. The other >> version dumps a lot on disk, where available storage becomes the >> limit. It happens with the English Wikipedia so it will for sure >> happen with the wiki-links dataset. >> >> I added one comment from Pablo to the issue. Do you have experience with >> Pig? >> @Chris, did you continue to try to solve this issue? Maybe you can >> give Zhiwei some pointers? >> >> Scalability issues pop up frequently when dealing with this amount of >> data. Solving them is probably part of the actual GSoC coding period >> (while I won't stop you from solving it outside of it as a cool good >> open source contributor ;) ). >> In order to get familiar with the indexing pipeline and the code >> (warm-up phase), it might be best if you build a model from a subset >> of the English Wikipedia, limiting the input in the beginning of the >> Pig Latin script or by truncating the Wikipedia dump XML. >> >> Cheers, >> Max >> > >
part-m-00001.bz2
Description: BZip2 compressed data
------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________ Dbpedia-gsoc mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
