On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout <[email protected]>wrote:
> Hello, > > I've downloaded and wrote a simple parser to give me pedia urls from this > dbpedia file > <http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as shown > below. I find the result unsatisfactory since it contains many duplicates. > Adding logic to the parser to avoid them (through remembering) seems to be > also very expensive, since the file size (uncompressed) is 3GB. Is there a > better approach to get Wikipedia urls like is done with dmoz in > > wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz > bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > > dmoz/urls > > > > http://en.wikipedia.org/wiki/AfghanistanGeography > http://dbpedia.org/resource/AfghanistanGeography > http://en.wikipedia.org/wiki/AfghanistanGeography > n"@e > http://dbpedia.org/resource/AfghanistanGeography > http://en.wikipedia.org/wiki/AfghanistanGeography > http://en.wikipedia.org/wiki/Anarchism > http://dbpedia.org/resource/Anarchism > http://en.wikipedia.org/wiki/Anarchism > n"@e > http://dbpedia.org/resource/Anarchism > http://en.wikipedia.org/wiki/Anarchism > http://en.wikipedia.org/wiki/AccessibleComputing > http://dbpedia.org/resource/AccessibleComputing > http://en.wikipedia.org/wiki/AccessibleComputing > n"@e > http://dbpedia.org/resource/AccessibleComputing > http://en.wikipedia.org/wiki/AccessibleComputing > http://en.wikipedia.org/wiki/AfghanistanHistory > http://dbpedia.org/resource/AfghanistanHistory > http://en.wikipedia.org/wiki/AfghanistanHistory > n"@e > http://dbpedia.org/resource/AfghanistanHistory > http://en.wikipedia.org/wiki/AfghanistanHistory > http://en.wikipedia.org/wiki/AfghanistanPeople > http://dbpedia.org/resource/AfghanistanPeople > http://en.wikipedia.org/wiki/AfghanistanPeople > n"@e > http://dbpedia.org/resource/AfghanistanPeople > http://en.wikipedia.org/wiki/AfghanistanPeople > http://en.wikipedia.org/wiki/AfghanistanTransportations > http://dbpedia.org/resource/AfghanistanTransportations > http://en.wikipedia.org/wiki/AfghanistanTransportations > n"@e > http://dbpedia.org/resource/AfghanistanTransportations > http://en.wikipedia.org/wiki/AfghanistanTransportations > http://en.wikipedia.org/wiki/AfghanistanCommunications > http://dbpedia.org/resource/AfghanistanCommunications > http://en.wikipedia.org/wiki/AfghanistanCommunications > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > >
------------------------------------------------------------------------------ Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d
_______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
