Hi Ferran! I quickly reply you just on the signal issue:
There is a patch available that you can apply directly to release 0.99.1. You can find it in my public branch here: <http://cdsware.cern.ch/repo/?p=personal/cds-invenio-sam.git;a=commit;h=04d2cee71d151b7ce600011f4d2414ff28020419> This should apply cleanly to your repository. Best regards, Samuele ________________________________________ Da: Ferran Jorba [[email protected]] Inviato: lunedì 15 marzo 2010 12.55 A: project-cdsware-users (CDS Invenio users) Oggetto: Recommend lynx instead of html2text, and signal issues when indexing fulltext Hello Invenio developers, sorry for this long mail and my unability to provide patches to fix my current issue, but I like to comment it in the -users list because I think may matter to other installations. After finishing our 0.99.1 migration, we've started to index fulltext in our installation. To start with, and as stated in the -developers list, I've had to patch bibindex_engine.py so that it accepts any 856 second indicator, changing _ to %, ex: @@ -1455,7 +1458,7 @@ def task_run_core(): # Let's work on single words! wordTables = get_word_tables(task_get_option("windex")) for index_id, index_tags in wordTables: - wordTable = WordTable(index_id, index_tags, 'idxWORD%02dF', get_words_from_phrase, {'8564_u': get_words_from_fulltext}) + wordTable = WordTable(index_id, index_tags, 'idxWORD%02dF', get_words_from_phrase, {'8564%u': get_words_from_fulltext}) _last_word_table = wordTable wordTable.report_on_table_consistency() I know that it may be not accepted yet to be merged upstream, and myself, I have to do a full check. A second issue I'm having is that, in our site, we have a lot of HTML documents, and a bunch of them are in non-utf8 charset (mostly iso-8859-1 and windows-1251). I have been watching and debugging it the whole morning. In a word, bibindex_engine expect everything in utf8, and when it is not, it complains loudly. Adding the exception to the message, I got: 2010-03-15 09:47:05 --> Error: Cannot put word num??riques with sign 1 for recID 10 (exception: 'utf8' codec can't decode bytes in position 9-11: invalid data). How to get utf8 clean text from any HTML document, from any charset? html2text has the -ascii option to output unaccented text, but it didn't do anything good in my files. Fortunately, lynx does it cleanly. This quick-and-dirty patch allows me to do some progress: @@ -417,6 +417,8 @@ def get_words_from_fulltext(url_direct_or_indirect, stemming_language=None): elif os.path.basename(conv_program) == "html2text": cmd = "%s %s > %s" % \ (conv_program, tmp_name, tmp_dst_name) + cmd = "lynx -dump -display_charset=utf8 %s >%s" % \ + (tmp_name, tmp_dst_name) else: write_message("Error: Do not know how to handle %s conversion program." % conv_program, sys.stderr) # try to run it: But my joy was short lived after seeing that my bibindex task ended in error. Running it in verbose -v9 mode, I can see a lot of 'got signal 12 frame' messages like this one: [...] 2010-03-15 11:54:08 --> ... data to elaborate: [('pdf', 'http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf')] 2010-03-15 11:54:08 --> .... processing pdf from http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf started 2010-03-15 11:54:09 --> ..... launching /usr/bin/pdftotext -enc UTF-8 /tmp/tmpGyWn-Minvenio.tmp /home/ddd/invenio/var/tmp/tmpvz3OGFinvenio.tmp.txt 2010-03-15 11:54:09 --> .... processing pdf from http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf ended 2010-03-15 11:54:09 --> ... reading fulltext files from http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf ended 2010-03-15 11:54:09 --> ... reading fulltext files from http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf started 2010-03-15 11:54:09 --> ... data to elaborate: [('pdf', 'http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf')] 2010-03-15 11:54:09 --> .... processing pdf from http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf started 2010-03-15 11:54:10 --> task_sig_ping(), got signal 12 frame <frame object at 0x2b60d70> 2010-03-15 11:54:10 --> Updating task status to ERROR. 2010-03-15 11:54:10 --> Task #615 finished but not resubmitted. [ERROR] I've ben digging into this signal issue into the git repository, I found the following patch from Samuele: http://cdsware.cern.ch/repo/?p=cds-invenio.git;a=commitdiff;h=795252c39cdaaefd8649185373a8869064801d14 But after adjusting the path of the files to my running installation, it doesn't apply cleanly: $ guilt push Applying patch..dropped-signal-usage-in-bibsched-bibtasks.patch error: patch failed: lib/python/invenio/bibsched.py:747 error: lib/python/invenio/bibsched.py: patch does not apply error: patch failed: lib/python/invenio/bibtask.py:53 error: lib/python/invenio/bibtask.py: patch does not apply To force apply this patch, use 'guilt push -f' Summing up my help request: I found workarounds except for this signal issue, well above my skills. Do you have any suggestion to overcome it? Thanks, Ferran
