Hi Zhiwei, Max,

I just ran names_and_entities.pig locally, and got the same error you had,
on a dump with 1000 articles. I fixed the error by doing two things:

1) added this line to the script: SET mapred.child.java.opts '-Xmx2048m'
2) _commented out_ this line: --set io.sort.mb 1024

I think the second point is what fixed it. My guess is that this parameter
causes the heap space error when you're running locally. Another note:
you'll probably want to maintain separate .params files for local tests and
running on a real cluster. I'm attaching part of the output from
names_and_entities.pig. Zhiwei, please shoot me an email if you have any
more issues like this. Jo and I have wrestled with quite a few of them :-),
and there's obviously still a lot to fix.

Cheers,
Chris


On Mon, Apr 22, 2013 at 3:08 PM, Cai Zhiwei <[email protected]> wrote:

> Hi Max,
>
> My test wikipedia dump file contains only two pages! My computer has 5g
> memory and it run out in a few seconds. Is that reasonable? I only want to
> see the ouput file and know more about details. I also tried
> the names_and_entities_low_memory.pig version,but still failed. My test
> dump file is attached.
> Pig version:0.11.1
> Operation System:Ubuntu 12.04
>
> Max,if you have time,could you please run names_and_entities.pig with this
> file and send the output file to me? That will help me get a better
> understanding of index pipeline.
>
>
> Best regards,
> Zhiwei
>
>
> On Mon, Apr 22, 2013 at 8:26 PM, Max Jakob <[email protected]> wrote:
>
>> Hi Zhiwei,  [CCing dbp-spotlight-developers mailing list]
>>
>> On Mon, Apr 22, 2013 at 1:30 PM, Cai Zhiwei <[email protected]>
>> wrote:
>> > I tried to index English data set with pignlproc but got stuck on this
>> step
>> > for a whole day.I used very small dumps file and tried every method
>> > mentioned at [1] but the problem still couldn't be solved.
>> >
>> > [1]https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/165
>>
>> Yes, this is still an important open issue. There is a version that
>> does a lot in memory and sometimes RAM is not enough. The other
>> version dumps a lot on disk, where available storage becomes the
>> limit. It happens with the English Wikipedia so it will for sure
>> happen with the wiki-links dataset.
>>
>> I added one comment from Pablo to the issue. Do you have experience with
>> Pig?
>> @Chris, did you continue to try to solve this issue? Maybe you can
>> give Zhiwei some pointers?
>>
>> Scalability issues pop up frequently when dealing with this amount of
>> data. Solving them is probably part of the actual GSoC coding period
>> (while I won't stop you from solving it outside of it as a cool good
>> open source contributor ;) ).
>> In order to get familiar with the indexing pipeline and the code
>> (warm-up phase), it might be best if you build a model from a subset
>> of the English Wikipedia, limiting the input in the beginning of the
>> Pig Latin script or by truncating the Wikipedia dump XML.
>>
>> Cheers,
>> Max
>>
>
>

Attachment: part-m-00001.bz2
Description: BZip2 compressed data

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to