Hi Kenney,

I must admit that we currently don't have documentation for how to enable 
Chinese full text indexing in DSpace.

However, if you are storing primarily Chinese full text documents in your 
DSpace, I don't think it would be too difficult to change the current Solr 
indexing settings to support that.

Solr has some documentation on how best to index Chinese here: 
https://solr.apache.org/guide/8_0/language-analysis.html#traditional-chinese

What I think you'd want to do in DSpace is to add a new​ fieldType called 
"text_mandarin" (or similar) to the 'search' schema:
https://github.com/DSpace/DSpace/blob/main/dspace/solr/search/conf/schema.xml<https://github.com/DSpace/DSpace/blob/main/dspace/solr/search/conf/schema.xml#L68-L104>
   This fieldType might look something like this:

<fieldType name="text_mandarin" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

Then, if you want the "fulltext" field (which stores the fulltext of documents) 
to always do indexing/parsing of Chinese, you'd change its type to be 
"text_mandarin" (instead of just "text") here:

https://github.com/DSpace/DSpace/blob/main/dspace/solr/search/conf/schema.xml#L237

Then you'd have to reindex everything in Solr (./dspace index-discovery -b).

I think​ this would work, but I'll admit I've never tried it.  So, it's always 
possible I'm overlooking a step to get this working.

Keep in mind, this would only change the behavior of full text 
indexing/searching... and it would change that behavior globally (so all 
documents in DSpace would be assumed to contain Chinese text).   Unfortunately, 
at this time, DSpace doesn't have any smart way to detect the language of 
documents and index each language differently.

If this sounds like what you need & you find it works for you, please let us 
know. That way we can more formally document similar instructions for others 
who may need them.

Tim


________________________________
From: [email protected] <[email protected]> on behalf of 
Kenney Guo <[email protected]>
Sent: Tuesday, August 30, 2022 8:14 PM
To: DSpace Technical Support <[email protected]>
Subject: [dspace-tech] documentation for Chinese full text indexing

Dear DSpace team,

With a default installation of the DSpace 7.2, I am not able to search my 
Chinese documents well. After some research, I realize that I can configure the 
(word) Analyzer in solr. However, I did not found any official documentation on 
how to do that. Could anyone point me to those documentations?

Thanks very much,

Kenney

--
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/edf5966d-8476-4a71-82d1-8b22e7b31b28n%40googlegroups.com<https://groups.google.com/d/msgid/dspace-tech/edf5966d-8476-4a71-82d1-8b22e7b31b28n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/PH0PR22MB32742C39C1E5259934586FACED7B9%40PH0PR22MB3274.namprd22.prod.outlook.com.

Reply via email to