Hi Chris,
I tried to do the second point as you said and the problem was solved. It
seems that the line "set io.sort.mb 1024" set the heap space back when I
try to set the larger larger. I will add a note in the document to avoid
others making the same mistake. Thanks a lot.
Best regards,
Zhiwei
On Tue, Apr 23, 2013 at 6:49 AM, Chris Hokamp <[email protected]>wrote:
> Hi Zhiwei, Max,
>
> I just ran names_and_entities.pig locally, and got the same error you had,
> on a dump with 1000 articles. I fixed the error by doing two things:
>
> 1) added this line to the script: SET mapred.child.java.opts '-Xmx2048m'
> 2) _commented out_ this line: --set io.sort.mb 1024
>
> I think the second point is what fixed it. My guess is that this parameter
> causes the heap space error when you're running locally. Another note:
> you'll probably want to maintain separate .params files for local tests and
> running on a real cluster. I'm attaching part of the output from
> names_and_entities.pig. Zhiwei, please shoot me an email if you have any
> more issues like this. Jo and I have wrestled with quite a few of them :-),
> and there's obviously still a lot to fix.
>
> Cheers,
> Chris
>
>
>
> On Mon, Apr 22, 2013 at 3:08 PM, Cai Zhiwei <[email protected]> wrote:
>
>> Hi Max,
>>
>> My test wikipedia dump file contains only two pages! My computer has 5g
>> memory and it run out in a few seconds. Is that reasonable? I only want to
>> see the ouput file and know more about details. I also tried
>> the names_and_entities_low_memory.pig version,but still failed. My test
>> dump file is attached.
>> Pig version:0.11.1
>> Operation System:Ubuntu 12.04
>>
>> Max,if you have time,could you please run names_and_entities.pig with
>> this file and send the output file to me? That will help me get a better
>> understanding of index pipeline.
>>
>>
>> Best regards,
>> Zhiwei
>>
>>
>> On Mon, Apr 22, 2013 at 8:26 PM, Max Jakob <[email protected]> wrote:
>>
>>> Hi Zhiwei, [CCing dbp-spotlight-developers mailing list]
>>>
>>> On Mon, Apr 22, 2013 at 1:30 PM, Cai Zhiwei <[email protected]>
>>> wrote:
>>> > I tried to index English data set with pignlproc but got stuck on this
>>> step
>>> > for a whole day.I used very small dumps file and tried every method
>>> > mentioned at [1] but the problem still couldn't be solved.
>>> >
>>> > [1]https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/165
>>>
>>> Yes, this is still an important open issue. There is a version that
>>> does a lot in memory and sometimes RAM is not enough. The other
>>> version dumps a lot on disk, where available storage becomes the
>>> limit. It happens with the English Wikipedia so it will for sure
>>> happen with the wiki-links dataset.
>>>
>>> I added one comment from Pablo to the issue. Do you have experience with
>>> Pig?
>>> @Chris, did you continue to try to solve this issue? Maybe you can
>>> give Zhiwei some pointers?
>>>
>>> Scalability issues pop up frequently when dealing with this amount of
>>> data. Solving them is probably part of the actual GSoC coding period
>>> (while I won't stop you from solving it outside of it as a cool good
>>> open source contributor ;) ).
>>> In order to get familiar with the indexing pipeline and the code
>>> (warm-up phase), it might be best if you build a model from a subset
>>> of the English Wikipedia, limiting the input in the beginning of the
>>> Pig Latin script or by truncating the Wikipedia dump XML.
>>>
>>> Cheers,
>>> Max
>>>
>>
>>
>
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc