once you get the HttpUrlConnection InputStream then you might consider running 
it through a library like JTidy (
http://sourceforge.net/projects/jtidy ) to convert it to a DOM tree you can use.

in the case of a project i play with from time to time of gathering all the 
dilbert images onto one page i found that
JTidy couldn't parse the HTML into XML (HTML can be so badly formatted only a 
very forgiving browser can present it)
so i switched to using regular expressions to extract the specific lines i was 
looking for and then reg-exp capture
groups to pick out the URLs i'm interested in.  this also worked very well 
writing a link-validator engine to find bad
links on web pages.

the problem very much depends on what the HTML you're getting back looks like.  
and then what you want to extract from
it (URLs, particular strings, etc).  and you always have to watch for the HTML 
format being changed over time as it
may break your parser.  and hope that the server doesn't use JavaScript or some 
other scripting language to modify the
DOM at display time.  also if you can convert the HTML into XML you can apply 
XSLT to extract any parts you want quite
easily.

contact me directly if you have specific questions - i enjoy web-page scraping 
projects. :)

...........ron.

> Unfortunately, I don't have much control over the
> server.  (It is owned by a different organization.)
>
> I am doing this for part of a school project... if I
> couldn't use Axis... what JAVA library or otherwise
> would you suggest for getting the HTML response taken
> care of?  Basically, I'm only interested in getting a
> client that I can get the correct output and
> eventually be able to parse.
>
> JR
>
> --- Ron Reynolds <[EMAIL PROTECTED]> wrote:
>
>> depends on how much control you have over the
>> server.  to support both html and soap responses i
>> would add a seperate
>> handler/url on the server to return a soap response
>> instead of html.  or if you want to use the same url
>> you could add
>> a url parameter to dictate a soap response (and the
>> lack of the parameter would indicate html so as to
>> not impact
>> current html clients).  and if you have no control
>> over the server you could create your own custom
>> client which would
>> have to parse the html (perhaps using jtidy to
>> convert it to xml for easier parsing).  i don't
>> think a custom
>> serializer on the client would work because i think
>> there's a soap-envelope-parsing layer between
>> receiving the
>> response and invoking the deserializers, but i
>> haven't delved into this piece of axis yet, tho
>> today i'm actually
>> starting on creating custom serializer/deserializers
>> so if it turns out you can handle the response as
>> pure HTML i'll
>> let you know, tho i'm pretty sure at the very least
>> the (de)serializers get a DOM Document object which
>> means the
>> response has already gone through an XML parser so
>> if it's not well-formed html (i.e., xhtml) it won't
>> get past that
>> parser.
>>
>> hth.
>> ..............ron.
>> > Ahah... I think the server returns an HTML
>> document
>> > with the results instead of a traditional
>> response.  I
>> > would  like to use AXIS to get this document.
>> (The
>> > service I am connecting to suggests NuSOAP for
>> PHP.)
>> > Are there any way to do this?
>> >
>> > Thanks for everyone's help!
>> >
>> > JR
>> >
>> > --- Ron Reynolds <[EMAIL PROTECTED]> wrote:
>> >
>> >> i've seen this when i got redirected by an auth
>> >> filter and the auth server sent an HTML form
>> >> (assuming the client was
>> >> a browser) which, of course, doesn't parse as
>> soap
>> >> very well.  it may be your server is trying to
>> >> return some form of
>> >> HTML error page or perhaps a generated directory
>> >> listing (if the URL maps to a directory).  relay
>> >> through tcp mon or
>> >> get the client-side to dump the response to log
>> to
>> >> at least see what the client is getting.
>> >>
>> >> hth.
>> >> .................ron.
>> >>
>> >> > I have been working on using Apache Axis 1.2 to
>> >> > interact with a Web Service that is using
>> nuSOAP.
>> >> I
>> >> > get the following error:
>> >> >
>> >> > - Exception in AddressingHandler
>> >> > AxisFault
>> >> >  faultCode:
>> >> >
>> >>
>> >
>>
> {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
>> >> >  faultSubcode:
>> >> >  faultString: org.xml.sax.SAXException: Bad
>> >> envelope
>> >> > tag:  html
>> >> >  faultActor:
>> >> >  faultNode:
>> >> >  faultDetail:
>> >> >
>> >> >
>> >>
>> >
>>
> {http://xml.apache.org/axis/}stackTrace:org.xml.sax.SAXException:
>> >> > Bad envelope tag:  html
>> >> >         at
>> >> >
>> >>
>> >
>>
> org.apache.axis.message.EnvelopeBuilder.startElement(EnvelopeBuilder.java:70)
>> >> >         at
>> >> >
>> >>
>> >
>>
> org.apache.axis.encoding.DeserializationContext.startElement(DeserializationContext.java:1048)
>> >> >         at
>> >> >
>> >>
>> >
>>
> org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown
>> >> > Source)
>> >> >
>> >> > What are the possible causes of this?
>> >> >
>> >> > Thanks for the help,
>> >> > JR
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > __________________________________
>> >> > Yahoo! Mail - PC Magazine Editors' Choice 2005
>> >> > http://mail.yahoo.com
>> >> >
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >
>> > __________________________________
>> > Yahoo! Mail - PC Magazine Editors' Choice 2005
>> > http://mail.yahoo.com
>> >
>>
>>
>>
>
>
>
>
> __________________________________
> Yahoo! Mail - PC Magazine Editors' Choice 2005
> http://mail.yahoo.com
>


Reply via email to