[freenet-dev] Library mega-code-review

Matthew Toseland Fri, 14 Aug 2009 23:51:21 +0100

Sorry if I haven't kept up! Git revisions with issues, then a rather chaotic 
changelog. This is aimed at both infinity0 and platy.

In testing, currently XML searches don't show progress - or perhaps they don't
start at all. Searching for "test" (no quotes) stays at "Unstarted", "Operation
length unknown" apparently indefinitely.

First, IMHO we should have a specific version of SnakeYAML to download with a
checksum. Would we have to host this ourselves?

Does the current code require the form password to start a search? I guess not
for purposes of inlining searches from freesites - but we could add it in in
the filter with a slightly more sophisticated form filtering callback.

Now on to per-commit issues. Some of these may be out of date.

Search.java (originally cde966c7353a87bac8f84eca192c0fc99cb91b57):

accesses to subsearches and allsearches need to be synchronized???

7f356226e4a0a175fcf5ccb16f36059327f352b6

MapCollectionTranslator assumes backing is what later becomes a SortedMapSet -
a map of elements each to itself. should maybe be private?

570872c82603d75e1922d061c51fd77f0d8d7183

CAPITALS should really only be used for constants (this commit is gone but
other places do this).

bd50ba2b048e153d34893dc0384a1b63d41ee6a8

is this wise? it will be different for every SSK under the same routing key, as
the routingkey is constructed from the pubkey hash and e(h(docname)) ... otoh
it will work for KSKs, unlike the old code.

740498979f511e4aafd710a448dae20586336304

containsValue() broken - the value will not equal the SkeletonValue.

74603b79fbce24725751cdfbfe7bcd15df704213

TermPageEntry: will the title be used? It says "here for back compatibility"??

cd84583f555ea5b57daed98a97ef00d9dedb9728 (platy)

LibrarianHandler:
setting description to "not available" has localisation issues - can we use ""
? I suppose we could use the localised version of "not available" ...
old bug, easy solution?

b18414028a2ce5eadbd9220eeca2b845aa13bb67 (platy)

ResultSet:
private HashMap<TermEntry, TermEntry> internal; - huh? why not a set?

intersect : internal hashset is unnecessary, just count them. and it will break
if any of them are equal subject - do we check for this?

99080a11562b4fee3eeed617bebd8dddaa0d080a

ResultSet again:
phrase : this could also be written with a counter and keeping the TermEntry
from the last iteration for the term positions, avoiding a list.

bcb80c457089f2114c0b31e85127c0b379786533

Breadth-first parallel inflate: this will likely make a lot of threads! I think
this may have been made pooled later?

Also you need to use NativeThread's to set priority class, which means you
probably need to use the node's executor (to increase priority)... we can deal
with this later if need be.

4000b1b88d14c1eda47e18fc45283c3d9a561a52

ThreadPoolExecutor may be a problem because of NativeThread, not an immediately
urgent issue though. We could use a ThreadFactory to solve this.

9a1855c7ba183a38ccd19f273a4fdb32edff7a83

Can we be certain that a freenet index can never cause us to search a file
index?

5394f13e60ca206370aeba722db261a897436fb2

FreenetArchiver.pushLive():
note that the bucket will in some cases be limited to expected_bytes, so you
should pass in a big number if you are not sure!

b880788bbe1f6966c0d887d369bfc840d602c39a

Library: singletons: seems to make sense, what about a high level regression
test using multiple nodes and a Library plugin on each? Maybe far fetched... I
have heard you can do this with classloader tricks, maybe that's better... Not
a problem if these objects are reusable and don't have any significant state...?

b5a597cfd57dbbb1eebf1eefebc5fac4cafc8cbf

ProtoIndexSerialiser:
Can we be sure here that we never leave the insert ID in the metadata and it
get serialised? It should be external, no? Or at least, it should be turned
from an insert uri into a request uri?

83491ddd786d84e080a467e989384916afd56416

Library.getType* etc:
Getting the type from the filename means that if the user (or the bookmark)
passes the *full* URI with the /index.xml or /index.yml, we only need to fetch
it once. We should try to make sure this happens as it's far more efficient!
First step would be to change the bookmark URLs...

caefcb5034a1a1aa4595391bc4203797ac47941e

Interesting - when you do a search from a freesite it can optionally add the
bookmark, so you confirm it. How about assuming that you *might* want to
bookmark it and just having an explanation and a button on every search using a
nonstandard index? Also, it should be a button, a POST request, not a link.

8702a6745fee793fe77af537b7c347adadcd2fc0

FreenetArchiver:
Fetch can go over 100% on newly inserted data??? If so this is a bug, but it is
prudent to detect and handle it. Certainly it can on old data.

8702a6745fee793fe77af537b7c347adadcd2fc0

SimpleProgress:
+ @Override public ProgressParts getParts() throws TaskAbortException {
+ // updates are made such that the ProgressParts contract isn't
broken even
+ // in mid-update so we don't need to synchronize here.
+ if (abort != null) { throw abort; }
+ return new ProgressParts(pdone, known, known, estimate);
}

Are you sure? Write order is not guaranteed if we don't synchronize.

ed0e629e2ba514ce23d33a286d380851e0eef8f5 (platy)

/**
* Construct a LibrarianHandler to look for many terms
* @param requests the requests wanting to be resolved by this
LibrarianHandler, results are written back to them
* @param wantPositions should the positions be recorded? this has a
performance impact but allows for phrase searching
* @throws java.lang.Exception
*/
public LibrarianHandler(List<FindRequest> requests) {
this.requests = new ArrayList(requests);
for (FindRequest r : requests){
r.setResult(new HashSet<TermPageEntry>());
-// wantPositions = wantPositions || r.shouldGetPositions();
}
- wantPositions = true;
}

Are we always getting the positions now for the relevance? We don't have to
keep them in memory after we have calculated relevance...

cafcea9db4c0007e9153eaa93bf4cebca450a07d

Library:
+
+ public Object getKeyTypeFromString(String indexuri) {
+ try {
+ return KeyExplorerUtils.sanitizeURI(new
ArrayList<String>(), indexuri);
+ } catch (MalformedURLException e) {
+ File file = new File(indexuri);
+ if (file.canRead()) {
+ return file;
+ } else {
+ throw new UnsupportedOperationException("Could
not recognise index type from string: " + indexuri);
+ }
+ }
}

Confusing function name.

cd058a3293e037da9428091ded0b09aa712c5e1e

ProgressParts:
/**
** Returns the parts done as a fraction of the estimated total parts.
** Note: this will return negative (ie. invalid) if there is no
estimate.
*/
final public float getEstimatedFractionDone() {
- return totalest == 0? 0.0f: (float)done / totalest;
+ return totalest == 0? 0.0f: totalest == TOTAL_FINALIZED? 1.0f:
(float)done / totalest;
}

Huh???? These don't make sense - finalized is surely not the same as done?

public static ProgressParts getSubParts(Iterable<? extends Progress>
subprogress, boolean try_to_be_smart) throws TaskAbortException {
int d = 0, s = 0, k = 0, t = 0;
int num = 0, unknown = 0;
+ boolean totalfinalized = true;
for (Progress p: subprogress) {
++num;
if (p == null) { ++unknown; continue; }
ProgressParts parts = p.getParts();
d += parts.done;
s += parts.started;
k += parts.known;
- if (parts.totalest < 0) { ++unknown; }
- else { t += parts.totalest; }
+ switch (parts.totalest) {
+ case ESTIMATE_UNKNOWN:
+ ++unknown;
+ totalfinalized = false;
+ break;
+ case TOTAL_FINALIZED:
+ t += parts.totalest;

This can't possibly be right. += -1 ??

+ break;
+ default:
+ totalfinalized = false;
+ t += parts.totalest;
+ }
}

66a26efb893d9491f72de59e5f3730933ebfd12c

If it completes, check the MIME type on the FetchResult before throwing.

Also you should consider accepting an index on the basis of its default
filename fetching successfully even if it is inserted as
application/octet-stream? I dunno, you are expecting the new spider to insert
indexes itself? I guess it's more important that we detect the filename if it
is specified, and encourage users to specify it.

Also, SnoopMetadata allows you to do equivalent things (directory listing) to
KeyExplorer without depending on a plugin; obviously this isn't a priority.

d1e0f212e51b2cbeb45ffcf087e44b9fbd569952

FindRequest:

}else
- throw new UnsupportedOperationException("Not
supported yet. : "+ce.getClass().getName());
+ throw new UnsupportedOperationException("Not
supported yet. This ClientEvent hasn't been identified as being needed yet :
"+ce.getClass().getName()); // Haven't come across any other ClientEvents
arriving here yet

Eeek, do you really have to throw? We are likely to introduce new events, you
should just ignore them, or log them.

d1e0f212e51b2cbeb45ffcf087e44b9fbd569952

XMLIndex:

/** Called on failed/canceled fetch */
public void onFailure(FetchException e, ClientGetter state,
ObjectContainer container){
+ if(e.newURI!=null){
+ try {
+ indexuri =
e.newURI.setMetaString(null).toString();
+ startFetch();
+ return;
+ } catch (FetchException ex) {
+ Logger.error(this, "what?", ex);
+ } catch (MalformedURLException ex) {
+ Logger.error(this, "what?", ex);
+ }
+ }
fetchStatus = FetchStatus.FAILED;
+ Logger.error(this, "Fetch failed on "+toString() + " -- state =
"+state, e);
}

Maybe there should be a counter? Infinite loop *should* be impossible, but bugs
might do bad things...

2884d18dd7d2472677a363caf38989e2d4c69d52

MainPage:
Do we actually forget the index after the user clicks no? I still think it
might be easier to just have a button shown when it's not a standard index...

ec328e158b438bd15894c7380efa7b95b32cb714 (platy)

is it necessary to use java/xml logging? The 200ms sleep is messy, should add a
TODO; not entirely sure what the problem is...

Changelog (rather schizophrenic, you don't have to read it, I was reviewing the
code anyway so; it has gone in the tag so you might want to look at it):

Library:
- Base on XMLLibrarian up to 3ed7453c0f70a61e21839bb5c0b4d11aab6c5c69 *AND*
Interdex at rev 0860091a26323360ff1a77c9d48077fc17b0d3e2. (minor build.xml
conflicts).
- Rebrand to Library. This will have UI and FCP for both old and new (interdex)
indexes. Move from src/plugins/XMLLibrarian/ to src/plugins/Library/ and update
build.xml (in some places using $(packagebase)). Move some packages around,
e.g. Interdex into index/Interdex, some stuff into util or search, xmlindex/
into index/xml.
- findTerm(), getIndex(), will presumably be exposed to FCP.
- Delete some unused code.
- Experiments with RPC-over-FCP code, all gone now.
- Add GPL headers.
- Javadocs: packages.
- Javadocs.
- Comments.
- Move plugin stuff to Main classs. Library for index stuff.
- Logging.
- Index interface.
- WritableIndex interface.
- Move some older Index stuff into XMLIndex, make it implement Index.
- Generics.
- Factor out AbstractRequest, base of FinndRequest.
- Simplify build-dep instructions.
- Move Interdex stuff to index/, or util/ or serial/, delete duplicated or
irrelevant files, move many files around.
- Indenting.
- Deprecate PrefixTreeMap.
- Try to merge the progress (Request) interfaces from both plugins. This
affects a remarkable amount of code, involving much refactoring. delete the
ERROR completion code, will throw in getResult().
- TODO file editing.
- bookmarks: bookmark:name(uri) bookmarks an index.
- Library.addBookmark(). Add one by searching ~name(url).
- catch RuntimeException's and display in the web UI.
- ProtoIndex.getTermEntries.getResult() returns an immutable collection.
- change serialised representation of btree root.
- Style changes, centralising style.
- better ui for searching multiple indexes, showing which you are using and
showing checkboxes for ones you could use.
- build.xml: fetch SnakeYAML and verify checksum.
- *Packer -> *SplitPacker. Deprecate *SplitPacker.
- New Packer: abstract class, a few methods must be filled in, but provides
configurable packing algorithm. Best fit decreasing with some options. Can fit
stuff into bins even if they already have data in them. Can repack. Pulls data
in to repack where necessary. Old Packer would split items, this greatly
complicates matters; new Packer does not.
- Toadlet interface for search. Searches can only be started from POSTs. POSTs
require the formPassword, thus neutralising any threat from malicious external
sites. Lots of bug chasing...
- BTreeSet. SkeletonBTreeSet.
- Interdex: error handling work.
- Interdex: Where a BTree has the data internally, there are no external
dependancies (as opposed to e.g. btree roots packed into a series of buckets).
which means that serialisation is tricky, because we expect to push the
dependancies as ghosts, at which point isBare() will be true and we can push
the node itself. If there is only internal data, copy it to the metadata so we
can be isBare() and then copy it when we serialise the node. In future there
will be a more efficient solution.
- Interdex: error checking.
- Interdex: put TokenEntry's within a keyword into a BTreeSet, not a TreeSet.
Pack the roots rather than the whole sets.
- Interdex: BinInfo, no longer an Object[2].
- Interdex: many corner cases.
- Interdex: abstract out SortedMapSet from BTreeSet, implements subSet etc,
cleans up code.
- Interdex: Fix BTreeMap iterator hasNext().
- Interdex: Extend unit tests.
- Interdex: Scale knows its Packer.
- Interdex: delete old code.
- Interdex: progress tracking support in Packer. Reinstate progress tracking
code, tweak job names for logging purposes.
- Interdex: fix some inflate bugs.
- Interdex: fix packer bugs related to completion.
- Interdex: support accessing by index, cache size of tree below each node
(_size or treeSize()), rename size() to nodeSize().
- Interdex: Get routing key from the NodeKey for URIKey's.
- Interdex: implement utab (uri -> uri entries). Implement translators and
packers for utab and ttab, BIndexSerialiser should now be functional.
- Interdex: make packer more generic/configurable, Inventory contains most
functions of Scale and proxies actual weighing to Scale
- Interdex: rename MAGIC to serialVersionUID.
- Interdex: proper hashCode, equals, etc for Token*Entry.
- Interdex: rename TokenEntry -> TermEntry, TokenURIEntry -> TermPageEntry.
- Interdex: set dumper options so dumps are readable for debugging purposes.
- Interdex: custom yaml format for BinInfo.
- Interdex: new SkeletonTreeMap, uses a wrapper for values with loaded status
and metadata, cleaner code, better iterators etc, in particular support for
iterating over unloaded data and it throws when you try to get the values,
partial support for subMap/headMap/tailMap, can load data while iterating (will
throw, caller can then load on the map, then go back to iterator), fixes some
bugs.
- Interdex: generics.
- Remove progress trackables in ProtoIndex, replace with a stack of trackables,
show progress while getting the root
- SkeletonBTreeMap: deflate/inflate on right hand side not left hand side.
- Interdex: Lots of minor refactoring.
- Don't depend on FredPluginAPI. It is unlikely to work any time soon, we will
use FCP instead.
- Eliminate URIWrapper in favour of TermPageEntry, which now includes title.
Remove code related to uri being a string on URIWrapper, since it's a
FreenetURI on TermPageEntry.
- Sanity checking.
- Interdex: refactor makeProgressIterable etc.
- CompositeRequest interface for requests with subrequests.
- TermEntry.equalsTarget(), rel.combine().
- Rewrite/refactor ResultSet to handle TermEntry's better, move combination
code out of Search. No phrase search yet, was turned off anyway. Is immutable,
operations happen in constructor. Split DIFFERENTINDEXES mode out of UNION,
currently handled the same, add SINGLE mode. Intersection may be slightly more
efficient.
- Refactor web interface for TermEntry's and CompositeRequest.
- Delete some old code.
- Combine relevances in Term*Entry, not in TermEntry, and make TermEntry
immutable, create a new one in combine().
- More refactoring, optimising merging modes in ResultSet.
- Working phrase search.
- Re-enable positions support in XML index code. Only use it if we need to.
- XML index: Calculate a simple relevance score based on word count and file
count.
- Fix comparisons (left/right/middle, different null handling in each) in
btreemap, order Node's for breadth-first search.
- Breadth-first search for inflation. ****** likely to use a lot of threads...
- AbstractRequest: record time taken.
- Interdex: TaskAbortException thrown some places, setAbort -> abort (which
throws it after setting it).
- Interdex: Improve progress text.
- Use a ThreadPoolExecutor in ParallelSerialiser. TaskHandler handles tasks on
a thread (in serial mode), shifts them to the executor if they are already
running, keeps progress data up to date; in parallel mode, run directly on
executor (caller must create progress).
- Interdex: ProgressTracker now uses a WeakHashMap, of tasks (not data).
get*Progress just gets the progress, add*Progress throws a
TaskInProgressException if the task is already running otherwise adds it.
- Interdex: Parallel inflation on BTreeMap.
- Interdex: Various refactoring.
- Remove wanna19 from default indexes.
- Remove FredPluginHTTP interface.
- Refactor web interface, new class ResultNodeGenerator, delete WebUI.
ResultNodeGenerator splits up results, will properly handle TermTermEntry's and
TermIndexEntry's, and less code duplication than old code.
- Split YamlArchiver into FileArchiver and YamlReaderWriter, use
ObjectStreamReader/Writer (simpler than ObjectInput/Output) interfaces.
- Freenet archiver. Initially write-only.
- Web interface access to Tester, on-freenet tests.
- Tester creates a FreenetArchiver, puts a load of test data, and pushes the
data, showing progress; progress test tests a map of ints to ints, push test
tests a smaller but more complex map of string to bunch of fake CHKs.
- SimpleProgress changes: track parts done, parts known, estimated parts, last
status string, addTotal -> addPartKnown. if estimated parts == total, then
isTotalFinal() is true. enteredSerialiser/exitingSerialiser must be called.
- Imports.
- Option to group USKs. Sort by relevance either way.
- Show title (tooltip) of links as the url for the direct link and the [ USK ].
- Show relevance in tooltip for the main link.
- Put search toadlet on the Browse Freenet menu.
- Refetch on demand after a failure.
- Make sure the search is removed after a TaskAbortException.
- Factor out combine() from TermEntry into a method in ResultSet.
- Remove WebUI references, reinstate FredPluginHTTP. **** merge fix hack
- CSS fixes.
- Detect when trying to do a phase search on an index with no position info.
- Only show the full error trace with a big box if the error is other than an
InvalidSearchException.
- Automatically determine index types by partially fetching the URI. Initially
implemented using KeyExplorer, detect via either mime type or existence of the
base filename for the index type. This is dubious, SEE BELOW.
- UI for adding an index.
- POST password check works a different way, can do a search from a POST but
not add an index.
- Use L10n (even though it doesn't work).
- Singleton-ise Library.
- Split BIndexSerialiser into ProtoIndexComponentSerialiser and
ProtoIndexSerialiser.
- Static map of one ProtoIndexComponentSerialiser per file format. Make many
members non-static. ProtoIndexSerialiser handles the overall ProtoIndex,
ProtoIndexComponentSerialiser helps to serialise the individual maps.
- Add a testing yaml index to the test for the auto-detect code.
- Pass the insert or fetch URI in to a push task / pull task via the metadata,
just as we pass in the filenames.
- Remove most DataFormatException constructors to avoid misuse.
- Static singleton ProtoIndexSerialiser in Library.
- Pass a FreenetURI into getIndexType, not a String.
- Wire in support for yaml indexes in Library.
- Yaml indexes support for pulling from Freenet
- Catch exceptions from getIndexType in parent method getIndex.
- Server-side state based authorisation/confirmation scheme for adding
bookmarks (this makes sense because we can add a bookmark by changing the name
of the index).
- Progress: ProgressParts object contains done, total, known, can presumably be
got atomically so avoids races. Remove RequestState from AbstractRequest. Lots
of related work. Request extends ChainedProgress, remove unneeded methods.
Eliminate AtomicProgress. New class BaseCompositeProgress, base composite
progress tracker for parallel tasks. ChainedProgress: series of tasks.
- BaseCompositeProgress.join() can see nulls because it is fed an iterable over
the tasks, which then looks up the progress's, which can be null before the
progress has started. After it has finished it will however still be present as
the iterator has a strong reference to the progress, and the weakhashmap is
weak on the keys - the tasks. Updates to SimpleProgress, lots of code.
- Various minor refactoring related to Progress, use the Progress interfaces
where possible, track what is needed for them in e.g. FindRequest.
- More work on progress, refactoring.
- System.currentTimeMillis, not new Date().getTime().
- Always get the termpositions if only for relevance.
- Avoid NPE displaying search page, show failures better.
- Better auto-detect of URI vs filename for indexes.
- Index type test: test the SSK URIs, not the USKs.
- Refactor: ProtoIndexSerialiser.forIndex() not new ProtoIndexSerialiser(),
have a static map of serialisers by class, change FreenetArchiver/FileArchiver
parameters, minor constants refactoring, minor refactoring in FileArchiver.
- Bugfix in ParallelSerialiser.isActive(), no idea consequences, probably none.
Bugfix in SkeletonNode.rev().
- Handle over 100% in splitfile progress better in Interdex index fetching.
- Another fix to FreenetArchiver progress from splitfile code (derive
percentage from minSuccessful not total).
- getTermEntriesHandler: combine metas and trackers into a LinkedHashMap, track
current tracker and meta.
- Refactor Progress estimate type. UNKNOWN or FINALIZED (= known, but only this
enables finalizedTotal() and thus completion), or a specific value. Finalized
total must be set before the request completes. This is often done
automatically by SimpleProgress.addPartKnown(,true). Set total finalised in
other places.
- Rename rootSize -> sizeRoot, add nodeMin, entMax, heightEstimate to BTree*.
- Progress with number of btree levels lookup estimate on getTermEntriesHandler.
- Less efficient index type detection not using KeyExplorer (it was having to
download a non-staging binary, IMHO this was bad). Based on MIME type, fetches
[key]/, [key]/index.xml, [key]/index.yml.
- Refactor FindRequest (XML index progress), use an enum not a string for
stages, keep an internal Progress impl up to date for the current stage, use
the index of the enum for the current stage to set the stage for ProgressParts
purposes. Use the enums when setting progress externally.
- Handle failure with a new URI - USK@ updates, not enough path components etc
- for XML indexes.
- Logging.
- Improve progress bars a little, fix them for new Progress.
- More CSS changes for progress bar.
- Minor fixes in web interface, better error catching/handling/displaying.
- Refactor web interface, now there is only MainPage.
- Beginnings of Interdex distributed search implementation. Interdex,
InterdexQuery, IndexQuery. Document algorithms a bit more in comments.
- Make fields private or final as needed.
- AbstractRequest: implement join().
- Web interface bugfix adding bookmarks.
- Style fixes, use style in POST handler.
- Search progress: don't return as composites, stick it all together, unless it
is different indexes being fetched. Probably looks better on search progress.
Implement status functions.
- Fix some NPEs, simpler progress bar.
- Synchronization in parsing.
- Temporary hack, put toadlet on /plugins/plugin.Library.FreesiteSearch - we
need to let /library/ through the filter.

platy
infinity0
sdiz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL:
<https://emu.freenetproject.org/pipermail/devl/attachments/20090814/fbe21a7e/attachment.pgp>

[freenet-dev] Library mega-code-review

Reply via email to