Re: [blfs-book] [BLFS Trac] #10910: xapian-core-1.46

BLFS Trac via blfs-book Sat, 18 Aug 2018 16:26:53 -0700

#10910: xapian-core-1.46
-------------------------+-----------------------
 Reporter:  bdubbs       |       Owner:  bdubbs
     Type:  enhancement  |      Status:  assigned
 Priority:  normal       |   Milestone:  8.3
Component:  BOOK         |     Version:  SVN
 Severity:  normal       |  Resolution:
 Keywords:               |
-------------------------+-----------------------


Comment (by bdubbs):

 Xapian-core 1.4.6 (2018-07-02):

 API:

 * API classes now support C++11 move semantics when using a compiler which
   we are confident supports them (currently compilers which define
   __cplusplus >= 201103 plus a special check for MSVC 2015 or later).
   C++11 move semantics provide a clean and efficient way for threaded code
 to
   hand-off Xapian objects to worker threads, but in this case it's very
   unhelpful for availability of these semantics to vary by compiler as it
   quietly leads to a build with non-threadsafe behaviour.  To address
 this,
   user code can #define XAPIAN_MOVE_SEMANTICS before #include <xapian.h>
 to
   force this on, and will then get a compilation failure if the compiler
 lacks
   suitable support.

 * MSet::snippet():

   + We were only escaping output for HTML/XML in some cases, which would
     potentially allow HTML to be injected into output (this has been
 assigned
     CVE-2018-0499).

   + Include certain leading non-word characters in snippets.  Previously
 we
     started the snippet at the start of the first actual word, but there
 are
     various cases where including non-word characters in front of the
 actual
     word adds useful context or otherwise aids comprehension.  Reported by
     Robert Stepanek in https://github.com/xapian/xapian/pull/180

 * Add MSetIterator::get_sort_key() method.  The sort key has always been
   available internally, but wasn't exposed via the public API before,
 which
   seems like an oversight as the collapse key has long been available.
   Reported by 张少华 on xapian-discuss.

 * Database::compact():

   + Allow Compactor::resolve_duplicate_metadata() implementations to
 delete
     entries.  Previously if an implementation returned an empty string
 this
     would result in a user meta-data entry with an empty value, which
 isn't
     normally achievable (empty meta-data values aren't stored), and so
 will
     cause odd behaviour.  We now handle an empty returned value by
 interpreting
     it in the natural way - it means that the merged result is to not set
 a
     value for that key in the output database.

   + Since 1.3.5 compacting a WritableDatabase with uncommitted changes
 throws
     Xapian::InvalidOperationError when compacting to a single-file glass
     database.  This release adds similar checks for chert and when
 compacting
     to a multiple-file glass database.

   + In the unlikely event that the total number of documents or the total
     length of all documents overflow when trying to compact a multi-
 database,
     we throw an exception.  This is now a DatabaseError exception instead
 of a
     const char* exception (a hang-over from before this code was turned
 into a
     public API in the library).

 * Document::remove_term(): Handle removing term at current TermIterator
   position - previously the underlying iterator was invalidated, leading
 to
   undefined behaviour (typically a segmentation fault).  Reported by
 Gaurav
   Arora.

 * TermIterator::get_termfreq() now always returns an exact answer.
 Previously
   for multi-databases we approximated the result, which is probably either
 a
   hang-over from when this method was used during Enquire::get_eset(), or
 else
   due to a thinking that this method would be used in that situation (it
   certainly is not now).  If the user creates a TermIterator object and
 asks it
   for term frequencies then we really should give them the correct answer
 - it
   isn't hugely costly and the documentation doesn't warn that it might be
   approximated.

 * QueryParser::parse_query():

   + Now adds a colon after the prefix when prefixing a boolean term which
     starts with a colon.  This means the mapping is reversible, and
 matches
     what omega actually does in this case when it tries to reverse the
 mapping.
     Thanks to Andy Chilton for pointing out this corner case.

   + The parser now makes use of newer features in the lemon parser
 generator to
     make parsing faster and use less memory.

 * Stem:

   + Add Indonesian stemming algorithm.

   + Small optimisations to almost all stemming algorithms.

 * Stopper:

   + Add Indonesian stopword list.

   + The installed version of the Finnish stopword list now has one word
 per
     line.  Previously it had several space-separated words on some lines,
 which
     works with C++'s std::istream_iterator but may be inconvenient for use
 from
     some other languages.

   + The installed versions of stopword lists are now sorted in byte order
     rather than whatever collation order is specified by LC_COLLATE or
 similar
     at build time.  This makes the build more reproducible, and also may
 be
     more efficient for loading into some data structures.

 * WritableDatabase::replace_document(term, doc): Check for last_docid
 wrapping
   when used on a sharded database.

 * Database::locked(): Consistently throw FeatureUnavailableError on
 platforms
   where we can't test for a database lock without trying to take it.
   Previously GNU Hurd threw DatabaseLockError while platforms where we
 don't
   use fcntl() locking at all threw UnimplementedError.

 * Database and WritableDatabase constructors: Fix handling of entries for
   disabled backends in stub database files to throw
 FeatureUnavailableError
   instead of DatabaseError.

 * Database::get_value_lower_bound() now works correctly for sharded
 databases.
   Previously it returned the empty string if any shard had no values in
 the
   specified slot.

 * PostingIterator was failing to keep an internal reference to the parent
   Database object for sharded databases.

 * ValueIterator::skip_to() and check() had an off-by-one error in their
 docid
   calculations in some cases with sharded databases.

 testsuite:

 * apitest:

   + Enable testcases flagged metadata, synonym and/or writable to run on
     sharded databases.

   + Enable testcases flagged writable to run on sharded databases.
 Writing to
     a sharded WritableDatabase has been supported since 1.3.2, but the
 test
     harness wasn't running many of the tests that could be with a sharded
     WritableDatabase.  This uncovered three bugs which are fixed in this
     release.

   + Support "generated" testcases for the inmemory backend, which
 uncovered a
     bug which is fixed in this release.

   + Skip testcase testlock1 on platforms that don't allow us to implement
     Database::locked() (which notably include GNU Hurd and Microsoft
 Windows).

   + Disable testlock2 on sharded databases as it fails for platforms which
     don't actually support testing the lock.

   + Extend tests of behaviour after database close.  Patch from Guruprasad
     Hegde.  Fixes https://trac.xapian.org/ticket/337

   + Enable testcase closedb5 for remote backends.  This testcase failed
 for
     remote backends when it was added and the cause wasn't clear, but it
 turns
     out it was actually a bug in the disk based backends, which was fixed
 way
     back in 2010.  Reported by Guruprasad Hegde.

   + Check for select() failing in retrylock1 testcase.  Retry on EINTR or
     EAGAIN, and report other errors rather than trying the read() anyway.
     Previously the read() would likely fail for the same reason the
 select()
     did, but at best this is liable to make what's going on less clear if
 the
     testcase fails.

 * Report bool values as true/false not 1/0.

 * Assorted minor testcase improvements.

 * The test harness now supports testcases which are expected to fail
 (XFAIL).
   Based on patch from Richard Boulton in
 https://trac.xapian.org/ticket/156.

 * Fix demangling of std::exception subclass names which wasn't happening
 due
   to a typo in the preprocessor check for the required header.  This was
 broken
   by changes in 1.4.2.
 * Make TEST_EQUAL() arguments side-effect free.  The TEST_EQUAL() macro
   evaluates its arguments a second time if the test fails in order to
 report
   their values.  This isn't ideal and really ought to be addressed, but
 for now
   fix uses where the argument has side-effect (e.g. *i++) such that the
   reported value should match the tested value.

 * runtest: Show usage if first option starts '-'.  Previously we ended up
   passing such options to libtool, so putting -v on runtest instead of
 apitest
   would run the tests but -v would effectively do nothing (it would make
   libtool verbose, but that doesn't make any difference in this case):
   ./runtest -v ./apitest

 * Suppress output from xcopy on MS Windows.

 * The test harness machinery for detecting file descriptor leaks should
 now
   work on any platform which has /dev/fd.

 * Implement recursive delete of a database directory in the test harness
   using nftw() if available (and not buggy like mingw64's seems to be),
 rather
   than running "rm -rf" as an external command.  This avoids the overhead
 of
   starting a new process each time we clean up a test database, which
 happens a
   lot during a test run.

 * Speed up generated test databases a little by adding a stat() check to
 avoid
   throwing and catching an exception when the database doesn't yet exist.

 * Skip timed tests when configured with --enable-log.  The logging can
 easily
   turn O(1) operations into O(n), and that's hard to avoid.  Fixes
   https://trac.xapian.org/ticket/757, reported by Guruprasad Hegde.

 matcher:

 * OP_VALUE_*: When a value slot's lower and upper bound are equal, we know
   that exactly how many documents the subquery can match (either 0 or
 those
   bounds).  This also avoids a division by zero which previously happened
   when trying to calculate the estimate.

 * Speed up sorting by keys.  Use string::compare() to avoid having to call
   operator< if operator> returns false.

 * Fix clamping of maxitems argument to get_mset() - it was being clamped
   to db.get_doccount(), now it's clamped to db.get_doccount() - first.  In
   practice this doesn't actually seem to cause any issues.

 * If a match time limit is in effect, when it expires we now clamp
   check_at_least to first + maxitems instead of to maxitems.  In practice
 this
   also doesn't seem to actually cause any issues (at least we've failed to
   construct a testcase where it actually makes an observable difference).

 * Fix percentages when only some shards have positions.  If the final
 shard
   didn't have positions this would lead to under-counting the total number
 leaf
   of subqueries which would lead to incorrect positional calculations (and
 a
   division by zero if the top level of the query was positional.  This bug
 was
   introduced in 1.4.3.

 * OP_NEAR: Fix "phantom positions", where OP_NEAR would think a term
 without
   positional information occurred at position 1 if it had the lowest term
   frequency amongst the OP_NEAR's subqueries.

 * Fix termfreq used in weight calculations for a term occurring more than
 once
   in the query.  Previously the termfreq for such terms was multiplied by
 the
   number of different query positions they appeared at.

 * OP_SYNONYM: We use the doclength upper bound for the wdf upper bound of
 a
   synonym - now we avoid fetching it twice when the doclength upper bound
 is
   explicitly needed.

 * Short-cut init() when factor is 0 in most Weight subclasses.  This
 indicates
   the object is for the term-independent weight contribution, which is
 always 0
   for most schemes, so there's no point fetching any stats or doing any
   calculations.  This fixes a divide by zero for TfIdfWeight, detected by
   UBSan.

 * OP_OR: Fix bug which caused orcheck1 to fail once hooked up to run with
 the
   inmemory backend.

 glass backend:

 * Fix glass freelist bug when changes to a new database which didn't
 modify the
   termlist table were committed.  In this corner case, a block which had
 been
   allocated to be the root block in the termlist table was leaked.  This
 was
   largely harmless, except that it was detected by Database::check() and
 caused
   it to report an error.  Reported by Antoine Beaupré and David Bremner.

 * Fix glass freelist bug with cancel_transaction().  The freelist wasn't
   reset to how it was before the transaction, resulting in leaked blocks.
   This was largely harmless, except that it was detected by
 Database::check()
   and caused it to report an error.

 * Improve the per-term wdf upper bound.  Previously we used min(cf(term),
   wdf_upper_bound(db)) which is tight for any terms which attain that
   upper bound, and also for terms with termfreq == 1 (the latter are
 common
   in the database (e.g. 66% for a database of wikipedia), but probably
   much less common in searches).  When termfreq > 1 we now use
   max(first_wdf(term), cf(term) - first_wdf(term)), which means terms with
   termfreq == 2 will also attain their bound (another 11% for the same
   database) while terms with higher termfreq but below the global bound
 will
   get a tighter bound.

 * Fix Database::locked() on single-file glass db to just return false
 (such
   databases can't be opened as a WritableDatabase so there can't be a
 write
   lock).  Previously this failed with: "DatabaseLockError: Unable to get
 write
   lock on /flintlock: Testing lock"

 * Fix compaction when both the input and output are specified as a file
   descriptor.  Previously this threw an exception due to an overeager
 check
   that destination != source.

 * Use O_TRUNC when compacting to single file.  If the output already
 exists but
   is larger than our output we don't want to just overwrite the start of
 it.
   This case also used to result in confusing compaction percentages.

 * Enable glass's "open_nearby_postlist" optimisation (which especially
 helps
   large wildcard queries) for writable databases without any uncommitted
   changes as well.

 * Make get_unique_terms() more efficient for glass.  We approximate
   get_unique_terms() by the length of the termlist (which counts boolean
 terms
   too) but clamp this to be no larger than the document length.  Since we
 need
   to open the termlist to get its length, it makes more sense to get the
   document length from that termlist for no extra cost rather than looking
 it
   up in the postlist table.

 * Database::check() now checks document lengths against the stored
 document
   length lower and upper bounds.  Patch from Uppinder Chugh.  Fixes
   https://trac.xapian.org/ticket/617.

 * Fix bogus handling of most-recently-read value slot statistics.  It
 seems
   that we get lucky and this can't actually cause a problem in practice
 due
   to another layer of caching above, but if nothing else it's a bug
 waiting to
   happen.

 * If we fail to create the directory for a new database because the path
   already exists, the exception now reports EEXIST as the errno value
 rather
   than whatever errno value happened to be set from an earlier library
 call.

 remote backend:

 * xapian-tcpsrv --one-shot no longer forks.  We need fork to handle
 multiple
   concurrent connections, but when handling a single connection forking
 just
   adds overhead and potentially complicates process management for our
 caller.
   This aligns with the behaviour under __WIN32__ where we use threads
 instead
   of forking, and service the connection from the main thread with --one-
 shot.

 * Fix repeat call to ValueIterator::check() on the same docid to not
 always
   set valid to true for remote backend.

 inmemory backend:

 * Fix repeat call to ValueIterator::check() on the same docid to not
 always
   set valid to true for inmemory backend.

 build system:

 * configure: Fix potentially confusing messages suggesting snprintf was
 added
   in C90 - it was actually standardised in C99.

 * Eliminate configure probes related to off_t by using C++11 features.

 * The installed xapian-config script is now cleaned up by removing code to
   handle use before installation.  This extra code contained build paths
   which meant the build wasn't bit-for-bit reproducible unless the same
   build directory name was used.  This change also eliminates use of
   automake's $(transform) (which seems to be intended an internal
 mechanism)
   and fixes "make uninstall" to remove xapian-config when a program-prefix
 or
   -suffix is in use (e.g. there's a default -1.5 suffix for git master
   currently).

 * Directory separator knowledge is now factored out into configure, based
 on
   $host_os and __WIN32__ (it seems hard to probe for this in a way which
 works
   when cross-compiling).

 * Fix build with --disable-backend-remote.

 * In an out-of-tree build configured with --enable-maintainer-mode
   and --disable-dependency-tracking we would fail to create the
   "tests/soaktest" and "unicode" directories in the build directory.
   Patch from Gaurav Arora.

 * Improve handling of multitarget rule stamp files.  Clean them on "make
   maintainer-clean" and ship them so that --enable-maintainer-mode when
   building from a tarball doesn't needlessly rerun the multitarget rules.

 * Split out allsnowballheaders.h again to avoid include path issues with
   unittest in out-of-tree maintainer-mode builds.

  xapian-core.pc: Both the Name and Description were too long compared to
   pkg-config norms, and the Description was trying to be multi-line which
 it
   seems pkg-config doesn't support.  Fixes
   https://github.com/xapian/xapian/pull/203, reported by orbea.

 documentation:

 * Stop describing Xapian as "Probabilistic" - we've also had non-
 probabilistic
   weighting schemes since 1.3.2.

 * Improve API docs for MSet::snippet().

 * Correct some class names in doxygen file documentation comments.

 * Mark up shell command as code-block:: sh.

 tools:

 * xapian-delve:

   + Document values can contain binary data, so escape them by default for
     output.  Other options now supported are to decode as a packed integer
     (like omindex uses for last modified), decode using
     Xapian::sortable_unserialise(), and to show the raw form (which was
 the
     previous behaviour).

   + Report current database revision.

 * xapian-inspect:

   + Report entry count when opening table

   + Support inspecting single file DBs via a new --table option (which can
 also
     be used with a non-single-file DB instead of specifying the path to
 the
     table).

   + Add "first" and "last" commands which jump to the first/last entry in
 the
     current table respectively.

   + "until" now counts and reports the number of entries advanced by.

   + Document "until" with no arguments - this advances to the end of the
 table,
     but wasn't mentioned in the help.

   + Commands "goto" and "until" which take a key as an argument now expect
 the
     key in the same escaped form that's used for display.  This makes it
 much
     simpler to interact with tables with binary keys.

   + Fix to expect .glass not .DB extension of glass tables.

 portability:

 * Sort out building using MSVC with the standard build system, and fix
 assorted
   problems.  MSVC 2015 or later is required for decent C++11 support.
 Both 32-
   and 64-bit builds are now supported.

 * Remove code specific to old MSVC nmake build system.  The latter has
 been
   removed already.

 * Don't use WIN32 API to parse/unparse UUIDs.  So much glue code is needed
 that
   it's simpler to just do the parsing and unparsing ourselves, and we
 already
   have an implementation which is used when generating UUIDs using /proc
 on
   Linux.  We still use UuidCreate() to generate a new UUID.

 * Improve compiler visibility attribute detection to check that using the
   attributes doesn't result in a warning - previously we'd enable them
 even on
   platforms which don't support them, which would result in a compiler
 warning
   for every file compiled.  We now probe for -fvisibility=hidden and
   -fvisibility-inlines-hidden together as it seems all compilers implement
 both
   or neither, and it's faster to do one probe instead of two.

 * Don't pass the same FDSET twice in same select() - this appears not to
 be
   allowed by current POSIX, and causes warnings with GCC8.

 * Fix compacttofd testcases to specify O_BINARY so they pass on platforms
   where O_BINARY matters.

 * configure: Probe for declaration of _putenv_s.  It seems that the symbol
 is
   always present in the MSVCRT DLL, but older mingw may not provide a
   declaration for it.

 * Fix "may be used uninitialised" warning with GCC 4.9.2 and -Os.

 * Suppress mingw32 deprecation warning for useconds_t.  We've already
 switched
   away from useconds_t on git master, but it's not easy to do for 1.4.x
 without
   ABI breakage.

 * Fix signed vs unsigned warnings with assertions on.

 * Use $(SED) instead of hard-coding "sed".  The rules concerned are all
 ones
   that only maintainers currently need to run, but we're likely to enable
   maintainer-mode by default at some point and then portability here will
   matter more.

 * Add missing explicit <algorithm> for std::max()/std::min().

 * Check for EAGAIN as well as EINTR from select().  The Linux select(2)
 man
   page says: "Portable programs may wish to check for EAGAIN and loop,
 just as
   with EINTR" and that seems to be necessary for Cygwin at least.

 * Probe for exp10() declaration as Cygwin seems to have the symbol but
 lacks a
   declaration in the headers.  Just ignoring it is simplest and we'll use
 GCC's
   __builtin_exp10() instead.

 * Fix warnings when building Snowball compiler with recent GCC.

 * Fix Perl script used during maintainer builds to work with Perl < 5.10.
 Such
   old perl versions shouldn't really be relevant for maintainer builds at
 this
   point, but appveyor's mingw install has such a Perl version.

 * Remove unused macro STATIC_ASSERT_TYPE_DOMINATES (unused, except by
   internaltest unit test for it, since the flint backend was removed in
 2011)
   and replace uses of STATIC_ASSERT_UNSIGNED_TYPE with C++11 features
   static_assert and std::is_unsigned instead.

 * Don't retry on (errno == EINTR) when read() or pread() indicates end-of-
 file.
   This could potentially have put us into an infinite loop if we
 encountered
   this situation and errno happened to be EINTR from a previous library
 call.

 * Make read-only data arrays consistently static and const.

 * Avoid casting invalid value to enum reply_type if an invalid reply code
 is
   received from a remote server.  This is technically undefined behaviour,
   though in practice probably not a problem.

 * Eliminate an array of function pointers and some char* array members in
   library, reducing the number of relocations needed at shared library
 load
   time, which reduces the total time to load the library.

 packaging:

 * Use https for tarball URLs in .spec files.  This provides protection
 against
   MITM attacks on people building packages using these spec files, and is
 also
   slightly more efficient as the http: URLs redirect to the https:
 versions
   anyway.

 debug code:
 debug code:

 * Fix build when configured with --enable-log due to bugs in debug logging
   annotations.  Patch from Uppinder Chugh.

 * Fix assertion for value range on empty slot.

 * Use AssertEq() rather than Assert with ==, the former reports the two
   values if the assertion fails.

 Xapian-core 1.4.7 (2018-07-19):

 API:

 * Database::check(): Fix bogus error reports for documents with length
 zero
   due to a new check added in 1.4.6 that the doclength was between the
 stored
   upper and lower bounds, which failed to allow for the lower bound
 ignoring
   documents with length zero (since documents indexed only by boolean
 terms
   aren't involved in weighted searches).  Reported by David Bremner.

 * Query: Use of Query::MatchAll in multithreaded code causes problems
 because
   the reference counting gets messed up by concurrent updates.  Document
 that
   Query(string()) should be used instead of MatchAll in multithreaded
 code, and
   avoid using it in library code.  Reported by Germán M. Bravo.

 * Stem:

   + Stemming algorithms added for Irish, Lithuanian, Nepali and Tamil.

   + Merge Snowball compiler changes which improve code generation.

   + Merge optimisations to the Arabic and Turkish stemmers.

 testsuite:

   + Fix duplicate test in apitest closedb10 testcase.  Patch from
 Guruprasad
     Hegde.

 glass backend:

 * A long-lived cursor on a table in a WritableDatabase could get into
   an invalid state, which typically resulted in a DatabaseCorruptError
   being thrown with the message:

       Db block overwritten - are there multiple writers?

   But in fact the on-disk database is not corrupted - it's just that
   the cursor in memory has got into an inconsistent state.  It looks
   like we'll always detect the inconsistency before it can cause on-disk
   corruption but it's hard to be completely certain.

   The bug is in code to rebuild the cursor when the underlying table
   changes in ways which require that, which is a fairly rare occurrence
   to start with, and only triggers when a block in the cursor has been
   released, reallocated, and we tried to load it in the cursor at the
   same level - the cursor wrongly assumes it has the current version
   of the block.

   Reported with a reproducer by Sylvain Taverne.  Confirmed by David
   Bremner as also fixing a problem in notmuch for which he hadn't managed
   to find a reduced reproducer.

 documentation:

 * INSTALL: Document need to have MSVC command line tools on PATH.

 portability:

 * Cygwin: Work around oddity where unlink() sometimes seems to indicate
 failure
   with errno set to ECHILD.

--
Ticket URL: <http://wiki.linuxfromscratch.org/blfs/ticket/10910#comment:2>
BLFS Trac <http://wiki.linuxfromscratch.org/blfs>
Beyond Linux From Scratch
-- 
http://lists.linuxfromscratch.org/listinfo/blfs-book
FAQ: http://www.linuxfromscratch.org/blfs/faq.html
Unsubscribe: See the above information page

Re: [blfs-book] [BLFS Trac] #10910: xapian-core-1.46

Reply via email to