I attached a commit patch (apply with `git am ...`) to the 'books.pharo.org' repo to update the Scraping .pdf link. (The .pdf it links to now is obsolete.)
> Sent: Friday, January 26, 2018 at 2:30 PM > From: "Stephane Ducasse" <[email protected]> > To: "Pharo Development List" <[email protected]> > Subject: Re: [Pharo-dev] How to get rid of empty XML nodes? > > Tx Monty! > This is a really important addition :) > Because a super frequent scenario. > > Stef > > On Fri, Jan 26, 2018 at 8:37 AM, monty <[email protected]> wrote: > > See #removeAllFormattingNodes and its comment in the latest version. > > > > And instances of SAXHandler and subclasses are meant to be created with > > #on: (or another "instance creation" message), _not #new_, otherwise they > > won't be properly initialized. The class comment is clear about this, but I > > should have overridden #new to raise an error like Stream does. Your misuse > > was helpful in bringing this to my attention, and I added a Stream-like > > #new implementation to SAXHandler. > > > >> Sent: Friday, December 08, 2017 at 9:21 AM > >> From: "Stephane Ducasse" <[email protected]> > >> To: "Pharo Development List" <[email protected]> > >> Subject: Re: [Pharo-dev] How to get rid of empty XML nodes? > >> > >> Hi monty > >> > >> > >> On Fri, Dec 8, 2017 at 9:03 AM, monty <[email protected]> wrote: > >> > By "empty XML nodes," do you mean whitespace-only string nodes? > >> > >> Yes > >> > >> > Those are included because all in-element whitespace is assumed > >> > significant by the spec: https://www.w3.org/TR/xml/#sec-white-space > >> > >> I know. There was a discussion a while ago. I just lost a couple of > >> hours understanding that :( > >> > >> But this is a super super super annoying practices. > >> We had to test each nodes to see if it is a empty nodes so it makes > >> everything a lot more complex without real justification > >> beside the fact that these standardizers probably never implemented > >> some real cases. > >> This standard is a really out of reality from that perspective. > >> > >> > The exception is if the element is declared in the DTD as only having > >> > element children ("element content"): > >> > https://www.w3.org/TR/xml/#dt-elemcontent > >> > >> Well the XML files that I had (I did not choose XML because I would > >> have prefer JSON :) ), had no DTD :( > >> > >> So at the end of the day, this wonderful standard puts all the stress > >> and burden to people. > >> > >> > > >> > For example, if you declare an element like this: > >> > > >> > <!ELEMENT one (two,three*,four?)> > >> > > >> > Any whitespace around a "two," "three," or "four" element child of a > >> > "one" element is insignificant and ignored (unless > >> > #preservesIgnorableWhitespace: is true). Other parsers, like LibXML2 and > >> > Xerces, behave the same way. > >> > > >> > I'll see if I can come up with some easier way to deal with this, like > >> > an optional parser setting, new enumeration methods, or maybe a tree > >> > transformation. > >> > >> It would be A HUGE PLUS!!!!!!!!!!!!!!!!!! > >> > >> > >> Because reality is that people have XML files with just nodes and no > >> empty nodes and they are forced to > >> Let me know because I could try. > >> > >> I was showing how to use Pharo to import code to pharo learners and > >> this was a big drag. > >> > >> Stef > >> > >> > >> I tried to set some values in the parser but it did not work. > >> BTW I saw that the configuration logic forces to write the following > >> > >> | parser doc visitor | > >> parser := XMLDOMParser new > >> on: self xmlContents; > >> preservesIgnorableWhitespace: true. > >> > >> and not > >> > >> | parser doc visitor | > >> parser := XMLDOMParser new > >> preservesIgnorableWhitespace: true. > >> on: self xmlContents; > >> > >> > >> > > >> >> Sent: Tuesday, December 05, 2017 at 8:29 AM > >> >> From: "Stephane Ducasse" <[email protected]> > >> >> To: "Pharo Development List" <[email protected]> > >> >> Subject: [Pharo-dev] How to get rid of empty XML nodes? > >> >> > >> >> )Hi > >> >> > >> >> we are manipulating an XML document and I would like to get rid of the > >> >> spurious empty string. > >> >> We saw that the gt panes are doing it. > >> >> > >> >> (aNodeWithElements isStringNode > >> >> and: [aNodeWithElements isEmpty > >> >> or: [aNodeWithElements isWhitespace]] > >> >> > >> >> Is there a way not to produce empty nodes? > >> >> Is there a simple way not to have to handle them > >> >> > >> >> Now each time we are dealing with a node with have to check. > >> >> > >> >> Stef > >> >> > >> >> > >> > > >> > >> > > > >
>From 32a11bd48135e23907ca3efce95d8e8347c65c04 Mon Sep 17 00:00:00 2001 From: monty <[email protected]> Date: Mon, 29 Jan 2018 07:48:42 -0500 Subject: [PATCH] updated Scraping booklet .pdf link --- index.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/index.html b/index.html index 5aa9aba..e5729ff 100644 --- a/index.html +++ b/index.html @@ -76,7 +76,7 @@ <div class="container"> <div class="row"> <div class="col-md-4"> - <a href="https://files.pharo.org/books-pdfs/booklet-Scraping/2017-09-29-scrapingbook.pdf" title="Scraping with XPath"> <img src="booklet-Scraping/bklet-scraping.png" alt="XPath HTML Scraping booklet" class="img-thumbnail" style="height:400px"></a> + <a href="https://bintray.com/squarebracketassociates/wip/download_file?file_path=scrapingbook-wip.pdf" title="Scraping with XPath"> <img src="booklet-Scraping/bklet-scraping.png" alt="XPath HTML Scraping booklet" class="img-thumbnail" style="height:400px"></a> <p><em>XPath </em> is a powerful technology. In this tutorial we show how we use it to scrap information from HTML page. Booklet written by S. Ducasse and P. Kenny. You can find the latest version <a href="https://github.com/SquareBracketAssociates/Booklet-Scraping">Here</a>.</p> </div> -- 2.11.0
