Re: HTML outllink extraction

Mingfai Thu, 02 Apr 2009 06:15:04 -0700

your comment in JIRA reminded me of another issue that i have encountered
when I'm working with Droids. For web server uses default index file, e.g.
http://www.apache.org/index.html
http://www.apache.org/


two links refer to the same document. Ideally, the crawler framework should
have some ways to manage this. I don't have a clear suggestion in mind,
however. If i didn't recall it incorrectly, one of the other crawlers
actually did some fake request to test certain server behavior, e.g. whether
a 404 error result as a normal 200 page will error message. Maybe there is a
single configuration (disabled by default). if enabled, when a "/index.*" is
encountered, it will make a request to "/" to test if the content is more or
less the same. If a host is detected as with default index, URI with the
matched index file are assumed to be identical as '/'. This is one of the
approaches we could consider.

Now, we have to code the handling logic in our own handler.

regards,
mingfai





On Thu, Apr 2, 2009 at 8:48 PM, Mingfai <[email protected]> wrote:

> thx, I looked at DROIDS-8 and DROIDS-11, and it seems it is better to
> create a new issue to keep track of resolve failure cases.
> https://issues.apache.org/jira/browse/DROIDS-45
>
> Tiki/NekoHTML is responsible for parsing and provision of the SAX event
> based parser. The actual URI resolving logic is implemented in Droids as of
> now.  And in fact, in the 2nd last comment in DROIDS-8, you guys think the
> Link Extractor doesn't depend on Tiki and should be moved to core.
>
> I wonder if you agree the default Droids Link Extractor should behave as
> same as a Mozilla browser for basic HTML link, ideally? (and for sure it
> can't handle JavaScript link)
>
> re. JDK bug, I'm using the JDK 1.6.0_12 and got the problem. Maybe no one
> has ever reported the issue or they just don't care about that. My point is,
> it seems we have to workaround the bug rather than hoping Sun Microsystems
> to fix the bug and every body upgrade to the latest revision. :-)
>
> Regards,
> mingfai
>
>
>
> On Thu, Apr 2, 2009 at 8:19 PM, Thorsten Scherler <
> [email protected]> wrote:
>
>> On Thu, 2009-04-02 at 19:38 +0800, Mingfai wrote:
>> > let's just look at the specific case first. Maybe I have jumped to the
>> > conclusion that the Link Extraction feature is too simple too soon.
>> >
>> > At line 139 of LinkExtractor.java, it uses URI.resolve(String) to
>> resolve a
>> > URI.
>> >       if (!target.toLowerCase().startsWith("javascript")
>> >           && !target.contains(":/")) {
>> > 139:        return base.getURI().resolve(target.split("#")[0]);
>> >       }
>> >       else if (!target.toLowerCase().startsWith("javascript")) {
>> >         return new URI(target.split("#")[0]);
>> >       }
>> >
>> > When I test the URI API with:
>> >   new URI("http://www.google.com";).resolve("index.php")
>> > it resolves the url to "http://www.google.comindex.php";
>> >
>> > if you didn't mean it is a bug with my JDK, then we need to specially
>> append
>> > a "/" prefix
>>
>> Hmm,
>>
>> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#resolve(java.net.URI)<http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#resolve%28java.net.URI%29>
>> "...
>> 3.Otherwise the new URI's authority component is copied from this URI,
>> and its path is computed as follows:
>>
>>     A. If the given URI's path is absolute then the new URI's path is
>>        taken from the given URI.
>>
>>     B. Otherwise the given URI's path is relative, and so the new URI's
>>        path is computed by resolving the path of the given URI against
>>        the path of this URI. This is done by concatenating all but the
>>        last segment of this URI's path, if any, with the given URI's
>>        path and then normalizing the result as if by invoking the
>>        normalize method.
>> ..."
>>
>> That sounds that new URI("http://www.apache.org";).resolve("index.html")
>> should return http://www.apache.org/index.html. Since it reads: "the
>> result as if by invoking the normalize method"
>>
>> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#normalize()<http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#normalize%28%29>
>>
>>
>> >
>> > And previously, I found another scenario that doesn't work, when there
>> is a
>> > link <a href="?test=true">test</a> under www.google.com/index.php , it
>> > resolves to www.google.com/?test=true rather than
>> > www.google.com/index.php?test=true like in a web browser.
>> >
>> > This makes me feel there are many special scenario that a crawler need
>> to
>> > cater. What do you think? Is it really so simple? My suggest to add a
>> page
>> > is for listing those special scenarios, that sometimes maybe just cause
>> by
>> > non-standard usage.
>>
>> Actually that should normally be handled by the above linked methods.
>> Please comment on issue DROIDS-8/DROIDS-11 if you find that the link
>> extraction is not working as expected.
>>
>> salu2
>>
>> >
>> > regards,
>> > mingfai
>> >
>> >
>> >
>> >
>> > On Thu, Apr 2, 2009 at 7:24 PM, Thorsten Scherler <
>> > [email protected]> wrote:
>> >
>> > > On Thu, 2009-04-02 at 18:53 +0800, Mingfai wrote:
>> > > > hi,
>> > > >
>> > > > The default LinkExtractor seems to be quite simple. (too simple) It
>> > > mainly
>> > > > uses URI.resolve and only cater the # and javascript scenarios.
>> (from
>> > > > LinkExtractor.java getURI) Simple usage link resolving a <a
>> > > > href="test.html"> for new URI("http://www.google.com";) will be
>> wrong as
>> > > it
>> > > > will return a http://www.google.comtest.html";.
>> > >
>> > > Well the link extraction always worked well. The case you just pointed
>> > > out looks like a bug BUT if you mean new URL
>> > > ("http://testServer.com","test.html)) then have a look at
>> > >
>> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL(java.net.URL<http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL%28java.net.URL>
>> <
>> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL%28java.net.URL
>> >,
>> > > java.lang.String)
>> > >
>> > > > And there are many case that
>> > > > the URI.resolve doesn't cater. It seems to me we need to do some
>> works at
>> > > > this area to make Droids more usable. Does anyone have any
>> experience in
>> > > out
>> > > > link extraction?
>> > >
>> > > Enhancements are always welcome however the link extraction should
>> work
>> > > fine. At least when I last looked at it was fine. The limitation ATM
>> is
>> > > the extraction of jscript generated links.
>> > >
>> > > >
>> > > > I'm trying to see how other frameworks handle out link extraction
>> and
>> > > looked
>> > > > at:
>> > > >
>> > >
>> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java?view=log
>> > >
>> > > Funny enough that have been the base of droids outlink extraction in
>> the
>> > > first version I hacked.
>> > >
>> > > >
>> > >
>> https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix2/engine/src/main/java/org/archive/extractor/RegexpHTMLLinkExtractor.java
>> > > > (Heritrix's JavaDoc shows they have given some good thought in
>> handling
>> > > > different tags and attributes)
>> > > >
>> > > > What do you think if I add a wiki page that list out some scenarios
>> of
>> > > > outlink handling (i.e. the requirement)? Or does anyone know if any
>> of
>> > > the
>> > > > many Java crawler projects have documentation at this area?
>> > >
>> > > If you do not look into jscript/ajax link extraction then there is no
>> > > secret to it. Either go with xpath expression or e.g. for plain text
>> > > with regexp. Please fell free to open a wiki page around the issue.
>> > >
>> > > salu2
>> > >
>> > > >
>> > > > regards,
>> > > > mingfai
>> > > --
>> > > Thorsten Scherler <thorsten.at.apache.org>
>> > > Open Source Java <consulting, training and solutions>
>> > >
>> > > Sociedad Andaluza para el Desarrollo de la Sociedad
>> > > de la Información, S.A.U. (SADESI)
>> > >
>> > >
>> > >
>> > >
>> > >
>> --
>> Thorsten Scherler <thorsten.at.apache.org>
>> Open Source Java <consulting, training and solutions>
>>
>> Sociedad Andaluza para el Desarrollo de la Sociedad
>> de la Información, S.A.U. (SADESI)
>>
>>
>>
>>
>>
>

Re: HTML outllink extraction

Reply via email to