Hi,

I need to read XML files from our suppliers, who generally use the ONIX
spec when generating their files. These files contain book metadata, and
from a high level have a fairly simple make up: 1 header section and 1
or more product sections.

My approach so far as been to use Reader to avoid loading the entire
files into memory, and for the most part this works fine.

The only place it falls down is when a file contains an entity that
isn't & < > or a numeric one.

Here's a sample file that uses the – entity:
http://gist.github.com/79386

And here's a contrived example that uses Reader to extract the Header
and Product records: http://gist.github.com/79387. If you run this, it
outputs the following nonfatal error and doesn't return the full
text of the Product node:

~/git/onix.git master$ ruby examples/entities.rb 
<Header>
  <FromCompany>HarperCollins Publishers</FromCompany>
  <ToCompany>Australian Booksellers Association</ToCompany>
  <SentDate>20081106</SentDate>
</Header>

Error: Entity 'ndash' not defined at examples/../data/entities.xml:28.

--

I have 2 questions:

- The ONIX DTD has a definition for a range of entities, including
  &ndash; Can I get libxml/Reader to recognise them?
- Failing that, can I get reader to just return entities unmodified
  instead of exiting with an error? I've tried passing various options
  to the Reader constructor (like XML::Parser::Options::NOENT) to no
  avail.

Cheers
  
-- James Healy <jimmy-at-deefa-dot-com>  Sun, 15 Mar 2009 22:18:44 +1100
_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel

Reply via email to