Hi Zhiwei, [CCing dbp-spotlight-developers mailing list] On Mon, Apr 22, 2013 at 1:30 PM, Cai Zhiwei <[email protected]> wrote: > I tried to index English data set with pignlproc but got stuck on this step > for a whole day.I used very small dumps file and tried every method > mentioned at [1] but the problem still couldn't be solved. > > [1]https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/165
Yes, this is still an important open issue. There is a version that does a lot in memory and sometimes RAM is not enough. The other version dumps a lot on disk, where available storage becomes the limit. It happens with the English Wikipedia so it will for sure happen with the wiki-links dataset. I added one comment from Pablo to the issue. Do you have experience with Pig? @Chris, did you continue to try to solve this issue? Maybe you can give Zhiwei some pointers? Scalability issues pop up frequently when dealing with this amount of data. Solving them is probably part of the actual GSoC coding period (while I won't stop you from solving it outside of it as a cool good open source contributor ;) ). In order to get familiar with the indexing pipeline and the code (warm-up phase), it might be best if you build a model from a subset of the English Wikipedia, limiting the input in the beginning of the Pig Latin script or by truncating the Wikipedia dump XML. Cheers, Max ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Dbpedia-gsoc mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
