> Do we still need the DMOZ parser?
DMOZ is now offline since 3 years [1] and none of the projects claiming to be
successors [2,3]
provides the RDF dumps required as input for the DMOZ parser.
It soon will become a legacy tool and we might think whether it's better to
remove it.
I remember that 4 years ago I've used DMOZ to seed a crawl of news sites from
all over the world.
The coverage of DMOZ was definitely good at this time. But it's clear: it will
degrade. And it's
not easy to find a copy of the dumps.
Sebastian
[1] https://en.wikipedia.org/wiki/DMOZ
[2] https://curlie.org/docs/en/rdf.html
[3] http://dmoztools.net/docs/en/rdf.html
On 1/25/21 12:04 PM, BlackIce wrote:
Do we still need the DMOZ parser?
On Sun, Jan 24, 2021 at 10:38 PM lewis john mcgibbney
<[email protected]> wrote:
Description:
An XML external entity (XXE) injection vulnerability was discovered in the Nutch
DmozParser and is known to affect Nutch versions < 1.18. XML external entity
injection (also known as XXE) is a web security vulnerability that allows an
attacker to interfere with an application's processing of XML data. It often
allows an attacker to view files on the application server filesystem, and to
interact with any back-end or external systems that the application itself can
access.
This issue is being tracked as NUTCH-2841
Credit:
The Apache Nutch Project Management Committee would like to thank Martin Heyden
for reporting this issue to the Apache Security Team. We are indebted.
--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc