On Wed, 3 Jan 2024 at 20:32, James Addison <j...@jp-hosting.net> wrote:
>
> (with apologies for forgetting to cc Rene on my previous message)
>
> On Wed, 3 Jan 2024 at 19:45, James Addison <j...@jp-hosting.net> wrote:
> >
> > On Wed, 3 Jan 2024 at 16:59, Thorsten Behrens <t...@libreoffice.org> wrote:
> > >
> > > Source: clucene-core
> > > Followup-For: Bug #1059805
> > > X-Debbugs-Cc: r...@debian.org, t...@libreoffice.org
> > >
> > > James Addison wrote:
> > > > And so a question: could the fix be achieved by changing the default
> > > > value of the version field from millis-timestamp to zero -- meaning:
> > > > without needing to adjust the API?
> > > >
> > > Note that we _only_ need this during build time (so another option
> > > would be using the custom clucene during build, but still link the
> > > system clucene for runtime).
> > >
> > > I simply don't know enough about the motivations to have this
> > > pseudo-random value there in the first place, to opine on whether the
> > > default can/should be changed...
> >
> > Agreed.  I'm trying to trace the origins of that behaviour.
> [ ... snip ... ]
> > This timestamping was introduced[1] in Lucene 1.9RC1 back in Y2005 -
> > the changelog entry[2] doesn't tell us much more, although we can
> > imply that it was for a timing/refreshness related fix, because the
> > change adds an isCurrent method (described as useful to check whether
> > an IndexReader has an up-to-date object reference to an index/segment)
> > and test coverage.  There is a subsequent race-condition fixup[3].
> [ ... snip ... ]
>
> Squinting more at the test coverage, a hypothetical explanation I can
> think of is this:
>
> If two index writer processes were likely to recreate an index file
> near the same time, or in the presence of long-lived index reader
> processes, then with a zero-based counter, a reader could easily
> mistake two independently-created index segments (created with segment
> version zero) as being 'current'.
>
> I don't know for certain that this is the scenario that the timestamp
> is intended to mitigate against -- it's certainly also possible that
> Solr replication (noted elsewhere in the LUCENE-3607 JIRA thread) was
> just as important, or maybe moreso.
>
> However: libreoffice is, as I understand it, using this during
> one-time build processes, and does not require replication of those
> indexes.
>
> I think the safe middle-ground rather than resetting to zero during a
> reproducible build would be to use the value of SOURCE_DATE_EPOCH[1].
> That should mean no API changes, and we're still using a timestamp,
> keeping us close to the current behaviour -- but in a more
> reproducible way.
[ ... snip ... ]

The attached patch for clucene-core-2.3.3.4+dfsg-1.1 compiles and
implements SOURCE_DATE_EPOCH time retrieval -- but only in the context
of the segment info version fields.

Running a help build with SOURCE_DATE_EPOCH=0 should, with this patch,
be more-or-less equivalent to calling the proposed additional API
method to reset the segment version to zero.

The reason it doesn't override system time retrieval globally in the
codebase is that RAMDirectory objects are initialized[1] using the
system current time, and contain a while loop[2] that compares the
lastmodified on those objects to the output of another, subsequent
call to retrieve the system time -- so I thought it'd be sensible to
avoid the risk of an infinite loop there (while ts1 == ts2).

Please note: I haven't tested the patch yet, hence not adding a
'patch' tag to this bug yet - I'm building libreoffice locally after
installing the patched+compiled clucene package, and will inspect the
help indexes constructed by the build.  Feedback / review /
alternative approach suggestions are welcome.

[1] - 
https://sources.debian.org/src/clucene-core/2.3.3.4%2Bdfsg-1.1/src/core/CLucene/store/RAMDirectory.cpp/#L45

[2] - 
https://sources.debian.org/src/clucene-core/2.3.3.4%2Bdfsg-1.1/src/core/CLucene/store/RAMDirectory.cpp/#L523
Description: Read default index segment version from SOURCE_DATE_EPOCH when set
Author: James Addison <j...@jp-hosting.net>
Bug-Debian: https://bugs.debian.org/1059805

---
--- clucene-core-2.3.3.4+dfsg.orig/src/core/CLucene/index/SegmentInfos.cpp
+++ clucene-core-2.3.3.4+dfsg/src/core/CLucene/index/SegmentInfos.cpp
@@ -560,7 +560,7 @@ string SegmentInfo::segString(Directory*
 
       //initialize counter to 0
       counter = 0;
-      version = Misc::currentTimeMillis();
+      version = Misc::currentTimeMillis(true /* checkSourceDateEpoch */);
 	  if (reserveCount > 1)
 		  infos.reserve(reserveCount);
   }
@@ -762,7 +762,7 @@ string SegmentInfo::segString(Directory*
 
 		  if(format >= 0){    // in old format the version number may be at the end of the file
 			  if (input->getFilePointer() >= input->length())
-				  version = CL_NS(util)::Misc::currentTimeMillis(); // old file format without version number
+				  version = CL_NS(util)::Misc::currentTimeMillis(true /* checkSourceDateEpoch */); // old file format without version number
 			  else
 				  version = input->readLong(); // read version
 		  }
--- clucene-core-2.3.3.4+dfsg.orig/src/shared/CLucene/util/Misc.cpp
+++ clucene-core-2.3.3.4+dfsg/src/shared/CLucene/util/Misc.cpp
@@ -139,7 +139,13 @@ void Misc::_cpycharToWide(const char* s,
 #endif
 
 //static
-uint64_t Misc::currentTimeMillis() {
+uint64_t Misc::currentTimeMillis(const bool checkSourceDateEpoch) {
+    if (checkSourceDateEpoch) {
+        char* sourceDateEpoch = std::getenv("SOURCE_DATE_EPOCH");
+        if (sourceDateEpoch && *sourceDateEpoch) {
+            return ((uint64_t) strtoul(sourceDateEpoch, NULL, 10)) * 1000;
+        }
+    }
 #ifndef _CL_HAVE_FUNCTION_GETTIMEOFDAY
     struct _timeb tstruct;
     _ftime(&tstruct);
--- clucene-core-2.3.3.4+dfsg.orig/src/shared/CLucene/util/Misc.h
+++ clucene-core-2.3.3.4+dfsg/src/shared/CLucene/util/Misc.h
@@ -15,7 +15,7 @@ CL_NS_DEF(util)
   class CLUCENE_SHARED_EXPORT Misc{
     static void zerr(int ret, std::string& err);
   public:
-    static uint64_t currentTimeMillis();
+    static uint64_t currentTimeMillis(const bool checkSourceDateEpoch = false);
     static const TCHAR* replace_all( const TCHAR* val, const TCHAR* srch, const TCHAR* repl );
     static bool dir_Exists(const char* path);
     static int64_t file_Size(const char* path);

Reply via email to