Mark, Thanks for your prompt reply.
> as the notes on the page you link to says, pdftotext > package doesnt have a "-" sign so this may be an > issue similar to this. This is not related to the accented characters issue. fyi the "-" sign is a problem when a word is split on two lines with an hyphen. This doesn't affect accented words more than others. pdftotext and iconv combined correctly convert PDF document containing french characters to UTF-8. The issue is elsewhere. Any other idea ? -- [email protected] Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be ----- Original Message ----- From: "Mark (Markie)" <[email protected]> To: "MediaWiki announcements and site admin list" <[email protected]> Sent: Tuesday, February 24, 2009 2:35 PM Subject: Re: [Mediawiki-l] Extension:FileIndexer has issue with accentedcharacters as the notes on the page you link to says, pdftotext package doesnt have a "-" sign so this may be an issue similar to this. regards mark On Tue, Feb 24, 2009 at 1:25 PM, Francois Piette < [email protected]> wrote: > Hi ! > > I have installed the Extension:FileIndexer new variant > (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#New_Variant) > from > Ramon Dohle (raZe) on my version 1.12 and it works well for english text. > When I upload a PDF file containing french accented characters such as > e-acute ("é"), those are wrongly indexed and show on the file upload page. > > I've looked inside the wiki database (table wikiprefix_searchindex, column > si_text) and found that an e-acute is represented as the string "u8c3a9" > for > any standard page while it is represented by "u8efbfbd" for the uploaded > PDF > entry. Actually any accented character is represented by "u8efbfbd" ! Of > course searching doesn't work with such caracter substitution. > > "u8c3a9" is actually the code for UTF-8. I'm not sure about "u8efbfbd" but > it seems is it a kind of placer holder. > > Any advice appreciated. > -- > [email protected] > Author of ICS (Internet Component Suite, freeware) > Author of MidWare (Multi-tier framework, freeware) > http://www.overbyte.be > > > _______________________________________________ > MediaWiki-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
