Re: [Dspace-tech] Odd Characters in Search Results...

2015-05-29 Thread bkelm
Andrea, thank you so much.

We add this to the top of our cron job:

LANG=en_US.UTF-8

We remove the corrupt text bundle and re-run the media filter:

/dspace/bin/dspace filter-media -i 10177/4732

and the files look perfect.

Bill K.





--
View this message in context: 
http://dspace.2283337.n4.nabble.com/Odd-Characters-in-Search-Results-tp4678061p4678125.html
Sent from the DSpace - Tech mailing list archive at Nabble.com.

--
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] Odd Characters in Search Results...

2015-05-28 Thread euler
Hi Andrea,

I guess I figured it out how to apply this in a windows environment. I just
added the line LANG=en_US.UTF-8 at the end of the command dspace
filter-media. I did a search on our repository first and looked for items
that returned odd characters in its search results. Then I force dspace to
reindex that particular item and the odd characters went away.

Hope this helps the original poster of this thread. ;-)

Thank you very much,

euler



--
View this message in context: 
http://dspace.2283337.n4.nabble.com/Odd-Characters-in-Search-Results-tp4678061p4678075.html
Sent from the DSpace - Tech mailing list archive at Nabble.com.

--
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] Odd Characters in Search Results...

2015-05-28 Thread Andrea Schweer
Hi,

On 28/05/15 17:21, euler wrote:
 Thanks for the link. I forgot to mention that I am using Windows 2003 as my
 OS, so I'm not using crontab, instead I have a batch file that is executed
 by Scheduled Tasks. Apologies for my ignorance, but I don't know how to
 apply this to a Windows environment.

I have no idea either, but perhaps (hopefully) someone else on this list 
can help!

cheers,
Andrea

-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand


--
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] Odd Characters in Search Results...

2015-05-27 Thread Andrea Schweer
Hi,

On 28/05/15 15:24, euler wrote:
 Thanks for this. I just assumed that the original characters in my pdfs were
 defective somehow (some are defective actually, not OCRed but digital born
 documents). I would be glad to know how to make sure that the dspace
 media-filter will use the correct locale and UTF-8 encoding? I may have
 missed something in the documentation on how to set this.

Well -- how are you running the media filter? If you're running it from 
a crontab on linux, try the line I put into my other reply. Or eg 
http://www.logikdev.com/2010/02/02/locale-settings-for-your-cron-job/

cheers,
Andrea

-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand


--
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] Odd Characters in Search Results...

2015-05-27 Thread euler
Hi Andrea,

Thanks for this. I just assumed that the original characters in my pdfs were
defective somehow (some are defective actually, not OCRed but digital born
documents). I would be glad to know how to make sure that the dspace
media-filter will use the correct locale and UTF-8 encoding? I may have
missed something in the documentation on how to set this.

Thanks in advance,
euler



--
View this message in context: 
http://dspace.2283337.n4.nabble.com/Odd-Characters-in-Search-Results-tp4678061p4678068.html
Sent from the DSpace - Tech mailing list archive at Nabble.com.

--
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] Odd Characters in Search Results...

2015-05-27 Thread euler
Hi Bill,

I'm having this issues also. I resolved this by adding TEXT in dspace.cfg,
ie xmlui.bundle.upload = ORIGINAL, TEXT, METADATA, THUMBNAIL, LICENSE,
CC-LICENSE so that I can upload TEXT bundle aside from the ORIGINAL which is
pdf. I just made sure that the text file was saved in UTF-8 encoding. The
question marks that you're seeing are the extracted text made by dspace
media-filter. By uploading a text version and deleting the extracted text,
the question marks in search results went away.

Hope this help.

Regards,
euler



--
View this message in context: 
http://dspace.2283337.n4.nabble.com/Odd-Characters-in-Search-Results-tp4678061p4678066.html
Sent from the DSpace - Tech mailing list archive at Nabble.com.

--
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] Odd Characters in Search Results...

2015-05-27 Thread Andrea Schweer
Hi,

On 28/05/15 14:44, euler wrote:
 I'm having this issues also. I resolved this by adding TEXT in dspace.cfg,
 ie xmlui.bundle.upload = ORIGINAL, TEXT, METADATA, THUMBNAIL, LICENSE,
 CC-LICENSE so that I can upload TEXT bundle aside from the ORIGINAL which is
 pdf. I just made sure that the text file was saved in UTF-8 encoding. The
 question marks that you're seeing are the extracted text made by dspace
 media-filter. By uploading a text version and deleting the extracted text,
 the question marks in search results went away.

If that solved the problem for you then my suspicion is that your media 
filter runs with the wrong locale. You need to make sure that the media 
filter is using UTF-8. I have
LANG=en_NZ.UTF-8
at the top of tomcat's crontab for that reason (you presumably want 
something other than en_NZ).

cheers,
Andrea

-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand


--
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette