Hi Tim, For additional info, here is an example - https://openrepository.aut.ac.nz/handle/10292/194
The item has a .doc file with 42,000 characters. With "textextractor.max-chars = 1000000" (changed to add one more 0), "filter-media -f" took a day to complete "index-discovery -b" ran about 10 mins. When I search "noodle" on Dspace 5.8, it is searched but not in Dspace 7.3 (after the full scan and re-index). The full scan fails with out of memory error when "textextractor.max-chars = -1" is set. Regards, Bryan On Wednesday, November 23, 2022 at 8:47:14 AM UTC+13 Snickers wrote: > Hi Tim, > > Thank you for the details you provided. We have made progress on this > issue. > I configured the lines "textextractor.max-chars = -1", > "textextractor.use-temp-file = true" then restarted the tomcat. > On a machine with 8GB RAM and max heap size 6G, I ran the "filter-media > -f" command. It ran for a while then failed with the output below: > > > ---------------------------------------------------------------------------------- > File: SundararajA.pdf.jpg > FILTERED: bitstream 84c9128e-34a7-42e5-a83d-64be008bb082 (item: > 10292/14803) and created 'SundararajA.pdf.jpg' > File: ATEM Poster - Serena OP.pdf.txt > FILTERED: bitstream 76432201-ee2b-481c-8c18-2889c935b2df (item: > 10292/4602) and created 'ATEM Poster - Serena OP.pdf.txt' > File: ATEM Poster - Serena OP.pdf.jpg > # > # There is insufficient memory for the Java Runtime Environment to > continue. > # Native memory allocation (malloc) failed to allocate 837021948 bytes for > AllocateHeap > # An error report file with more information is saved as: > > ----------------------------------------------------------------------------------- > > I have attached the error report "hs_err_pid4031.log". > > The storage spaces are as below: > -------------------------------------------------------- > Filesystem Type Size Used Avail Use% Mounted on > /dev/mapper/vg_root-lv_root xfs 20G 6.9G 14G 35% / > /dev/sda1 xfs 507M 221M 287M 44% /boot > /dev/mapper/vg_root-lv_var xfs 8.0G 3.0G 5.1G 38% /var > /dev/sdc xfs 1.0T 454G 570G 45% /DISK2 > /dev/mapper/vg_root-lv_tmp xfs 16G 1.1G 15G 7% /tmp > /dev/mapper/vg_root-lv_var_log xfs 4.0G 597M 3.5G 15% /var/log > --------------------------------------------------------- > > > Any idea or suggestion would be much appreciated. > > Regards, > Bryan > On Thursday, September 1, 2022 at 9:16:33 AM UTC+12 Tim Donohue wrote: > >> Hi Bryan, >> >> Search results issues can be difficult to track down without very >> specific examples or even links to a public website (feel free to use our >> demo7.dspace.org site to try and reproduce issues). >> Usually, it's best to look for common *patterns *in the results you are >> seeing, as that may be helpful to us in tracking down what those behaviors >> have in common (e.g. if all the files that do not match searches properly >> are PDFs, that's a clue. Or, if they all are large files, that'd be a >> different clue. Or, if you find a specific metadata field isn't searchable, >> that's yet another clue.) >> >> Since you specified that one difference is in the searching the *full >> text* of a document, it's possible that changes/updates to the full >> text indexing in DSpace 7.3 could be impacting your results. >> >> For instance, by default in DSpace 7.3, only the first 100,000 characters >> of a document are searchable. However, you can change this default in a >> configuration here: >> https://github.com/DSpace/DSpace/blob/main/dspace/config/dspace.cfg#L492-L498 >> >> (Notice in the comments that you'd have to re-extract text and re-index >> if you change this setting. Instructions are in those comments) >> >> That's a very quick guess though based on the limited info you've been >> able to provide so far. I'd recommend looking more closely at your results >> for patterns or common clues...that might be able to help us figure out >> what the cause may be (and whether it's a bug, or maybe just a >> configuration that needs to be tweaked). >> >> Tim >> ------------------------------ >> *From:* [email protected] <[email protected]> on behalf >> of Snickers <[email protected]> >> *Sent:* Wednesday, August 31, 2022 3:59 PM >> *To:* DSpace Technical Support <[email protected]> >> *Subject:* [dspace-tech] Re: Issue with Dspace 7 search >> >> Hi Tim, >> >> Thank you for your response. I am sure that there have been many >> improvements made to Dspace and Solr over the version updates and >> appreciate the effort of the devs. >> >> I looked a bit deeper into the search results from both 5.8 and 7.3. It >> seems that the search finds the keyword in the thesis text. However, I >> found an item where the keyword is mentioned once in the text and the >> search found it. However, I also found a few items where the keyword >> appeared once or more times in the text that 7.3 did not find but 5.8 >> >> Where possibly this can be looked into to resolve the issue? The number >> of items is similar and the items looked to be migrated successfully. I >> have successfully run the full reindex commands found in Step 4 of the >> migration doc: >> # Reindex all your content in DSpace >> ./dspace index-discovery -b >> >> # (Optionally) also reindex everything into OAI-PMH endpoint >> ./dspace oai import >> >> Please help. Any suggestion would be appreciated. >> >> Regards, >> Bryan >> >> On Wednesday, August 31, 2022 at 5:33:24 AM UTC+12 Tim Donohue wrote: >> >> Hi Bryan, >> >> It's really hard to say what could be going on without you digging more >> into which items matched in 5.8 which didn't match in 7.3 (or visa versa). >> It could be that 5.8 was actually returning incomplete results and the >> results are *more accurate* in 7.3. Or, as you imply, it's also possible >> the other way around...somehow 7.3 isn't returning as accurate of results >> as 5.8. >> >> But, it is worth pointing out that the Solr search settings under DSpace >> are enhanced little by little in every release. So, there were many >> changes/improvements in 6.x and continue to be more in 7.x. We've also >> upgraded Solr several times in those releases, so it's possible that Solr >> itself is returning slightly different results based on its new/updated >> behavior. >> >> Overall, until you dig more deeply into those search result differences >> between 5.x vs 7.x, I wouldn't assume that there's a bug in 7.x. There's >> also the possibility you are just seeing improvements that resulted in more >> accurate results. But, that said, if you are able to pinpoint some sort of >> buggy behavior, then definitely let us know & we'll work to get it assigned >> and fixed in a future 7.x release. >> >> Tim >> >> On Monday, August 29, 2022 at 11:32:51 PM UTC-5 [email protected] wrote: >> >> Hi, >> >> I am migrating Dspace from 5.8 to a new 7.3. I have followed the >> documentation and completed all tasks - >> https://wiki.lyrasis.org/display/DSDOC7x/Migrating+DSpace+to+a+new+server >> and https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace >> >> When I search, for example, I get 26 items from Dspace 7.3 whereas 79 are >> from Dspace 5.8. When searching for an empty space, I get similar total >> counts e.g. 5172 and 5192. >> >> Did anyone experience this? To be clear, I have run the reindexing >> commands found in the documentation above and the commands were completed >> successfully. >> >> There was no useful log found since this is technically not an error. >> >> Any idea or suggestion would be appreciated. >> >> Regards, >> Bryan >> >> -- >> All messages to this mailing list should adhere to the Code of Conduct: >> https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx >> --- >> You received this message because you are subscribed to the Google Groups >> "DSpace Technical Support" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/dspace-tech/2a800873-a3e5-4b5f-ba6c-b04be22139cen%40googlegroups.com >> >> <https://groups.google.com/d/msgid/dspace-tech/2a800873-a3e5-4b5f-ba6c-b04be22139cen%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx --- You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/05df7fca-de0d-438b-9229-51f0a952d3a9n%40googlegroups.com.
