#299: Bibupload: ignore elements with no child text node
---------------------+------------------------------------------------------
 Reporter:  Maddog   |        Type:  enhancement
   Status:  new      |    Priority:  minor      
Milestone:           |   Component:  BibUpload  
  Version:  v0.99.1  |    Keywords:             
---------------------+------------------------------------------------------
 Hi,
 in real-life harvesting I am often confronting with one problem - OAI
 repositories are usually messy and it is very difficult to avoid having
 some empty elements in converted MARCXML file for upload (and not having
 conditions all over transformation stylesheet). When you try to upload
 MARCXML file with some empty datafield or subfield, the bibupload crashes.
 I think it would be great just to ignore all (max-depth) elements
 containing no text nodes. For this i edited open_marc_file() function in
 bibupload.py file and added some simple pre-processing:

 {{{

 ### Extra imports
 import xml.dom.minidom as dom
 from xml.parsers.expat import ExpatError

 def open_marc_file(path):
     """Open a file and return the data"""
     try:
         # open the file containing the marc document
         marc_file = open(path,'r')
         marc = marc_file.read()
         marc_file.close()
         ### My edit ###
         try:
                 marcDom = dom.parseString(marc)
                 subfields = marcDom.getElementsByTagName("subfield")
                 for e in subfields:
                         if not e.hasChildNodes():
                                 parent = e.parentNode
                                 parent.removeChild(e)
                                 parent.normalize()
                                 if len(parent.childNodes) == 1 and
 isinstance(parent.childNodes[0], dom.Text):
 parent.removeChild(parent.childNodes[0])

                 fields = marcDom.getElementsByTagName("datafield")
                 for e in fields:
                         if not e.hasChildNodes():
                                 parent = e.parentNode
                                 parent.removeChild(e)
                 marc = marcDom.toxml().encode('utf-8')

         except ExpatError:
                 None
         ### End of my edit ###
     except IOError, erro:
         write_message("Error: %s" % erro, verbose=1, stream=sys.stderr)
         write_message("Exiting.", sys.stderr)
         task_update_status("ERROR")
         sys.exit(1)
     return marc
 }}}

 it works well for me so far, so if there is nothing wrong with this, maybe
 you should consider adding something like this in the system

-- 
Ticket URL: <http://invenio-software.org/ticket/299>
Invenio <http://invenio-software.org>

Reply via email to