Re: [Artifactory-users] (o.a.j.c.q.l.MultiIndex:1218) - 210000 nodes indexed... - taking waaaaay too much space

Noam Y. Tenne Sun, 01 Aug 2010 08:17:43 -0700

We've opened up http://issues.jfrog.org/jira/browse/RTFACT-3395 for thisissue.


On 08/01/2010 05:49 PM, Jay Colson wrote:

Noam,


Please see repo.xml below.  Oracle is 10.2.0.4.  Size of repo is just under 
1TB.  The total nodes indexed where right under 8 million.

j


<?xml version="1.0" encoding="UTF-8"?>
<!--
  ~ Artifactory is a binaries repository manager.
  ~ Copyright (C) 2010 JFrog Ltd.
  ~
  ~ Artifactory is free software: you can redistribute it and/or modify
  ~ it under the terms of the GNU Lesser General Public License as published by
  ~ the Free Software Foundation, either version 3 of the License, or
  ~ (at your option) any later version.
  ~
  ~ Artifactory is distributed in the hope that it will be useful,
  ~ but WITHOUT ANY WARRANTY; without even the implied warranty of
  ~ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  ~ GNU Lesser General Public License for more details.
  ~
  ~ You should have received a copy of the GNU Lesser General Public License
  ~ along with Artifactory.  If not, see<http://www.gnu.org/licenses/>.
  --><!DOCTYPE Repository PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit 2.0//EN" 
"http://jackrabbit.apache.org/dtd/repository-2.0.dtd";>
<Repository>

    <!-- Oracle Datasource -->
    <DataSources>
        <DataSource name="ds">
            <!-- Leave this on "oracle" -->
            <param name="databaseType" value="oracle" />
            <param name="driver" value="oracle.jdbc.driver.OracleDriver" />
            <param name="url" value="XXXX" />
            <param name="user" value="XXXX" />
            <param name="password" value="XXXX" />
            <!--<param name="validationQuery" value=""/>-->
            <!-- Unlimited when not specified -->
            <!--<param name="maxPoolSize" value="80"/>-->
        </DataSource>
    </DataSources>

    <!--
        virtual file system where the repository stores global state
        (e.g. registered namespaces, custom node types, etc.)
    -->

    <!-- Oracle File System -->
    <FileSystem class="org.apache.jackrabbit.core.fs.db.OracleFileSystem">
        <param name="dataSourceName" value="ds" />
        <param name="schemaObjectPrefix" value="rep_" />
        <!--<param name="schema" value=""/>-->
    </FileSystem>

    <!-- http://wiki.apache.org/jackrabbit/DataStore -->

    <!-- Oracle Datastore -->
    <DataStore 
class="org.artifactory.jcr.jackrabbit.ArtifactoryDbDataStoreImpl">
        <param name="dataSourceName" value="ds" />
        <param name="schemaObjectPrefix" value="ds_" />
        <!--<param name="tablePrefix" value=""/>-->

        <param name="minRecordLength" value="512" />
        <!-- Whether to use a cache blobs temporarily on the file system for 
faster reads -->
        <param name="cacheBlobs" value="true" />
        <!-- The maximum size of the blobs cache in gigabytes (g), megabytes (m) 
or kilobytes (k) -->
        <param name="blobsCacheMaxSize" value="1g" />
        <!-- The blobs cache directory location -->
        <param name="blobsCacheDir" value="${rep.home}/cache" />
    </DataStore>

    <!--
        security configuration
    -->
    <Security appName="Jackrabbit">
        <SecurityManager class="org.artifactory.jcr.NullJackrabbitSecurityManager" 
/>
    </Security>

    <!--
        location of workspaces root directory and name of default workspace
    -->
    <Workspaces rootPath="${rep.home}/workspaces" defaultWorkspace="default" />
    <!--
        workspace configuration template:
        used to create the initial workspace if there's no workspace yet
    -->
    <Workspace name="${wsp.name}">
        <!--
            virtual file system of the workspace:
            class: FQN of class implementing the FileSystem interface
        -->
        <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
            <param name="path" value="${wsp.home}" />
        </FileSystem>
        <!--
            persistence manager of the workspace:
            class: FQN of class implementing the PersistenceManager interface
        -->

        <!-- Oracle Persistence Manager -->
        <PersistenceManager 
class="org.apache.jackrabbit.core.persistence.pool.OraclePersistenceManager">
            <param name="dataSourceName" value="ds" />
            <param name="schemaObjectPrefix" value="${wsp.name}_" />
            <param name="bundleCacheSize" value="16" />
            <param name="errorHandling" value="IGN_MISSING_BLOBS" />
        </PersistenceManager>

        <!--
            Search index and the file system it uses.
            class: FQN of class implementing the QueryHandler interface

            If required by the QueryHandler implementation, one may configure
            a FileSystem that the handler may use.

            Supported parameters for lucene search index:
            - path: location of the index. This parameter is mandatory!
            - useCompoundFile: advises lucene to use compound files for the 
index files
            - minMergeDocs: minimum number of nodes in an index until segments 
are merged
            - volatileIdleTime: idle time in seconds until the volatile index is
              moved to persistent index even though minMergeDocs is not reached.
            - maxMergeDocs: maximum number of nodes in segments that will be 
merged
            - mergeFactor: determines how often segment indices are merged
            - maxFieldLength: the number of words that are fulltext indexed at 
most per property.
            - bufferSize: maximum number of documents that are held in a pending
              queue until added to the index
            - cacheSize: size of the document number cache. This cache maps
              uuids to lucene document numbers
            - forceConsistencyCheck: runs a consistency check on every startup. 
If
              false, a consistency check is only performed when the search index
              detects a prior forced shutdown. This parameter only has an effect
              if 'enableConsistencyCheck' is set to 'true'.
            - enableConsistencyCheck: if set to 'true' a consistency check is
              performed depending on the parameter 'forceConsistencyCheck'. If
              set to 'false' no consistency check is performed on startup, even
              if a redo log had been applied.
            - autoRepair: errors detected by a consistency check are 
automatically
              repaired. If false, errors are only written to the log.
            - analyzer: class name of a lucene analyzer to use for fulltext 
indexing of text.
            - queryClass: class name that implements the javax.jcr.query.Query 
interface.
              this class must extend the class: 
org.apache.jackrabbit.core.query.AbstractQueryImpl
            - respectDocumentOrder: If true and the query does not contain an 
'order by' clause,
              result nodes will be in document order. For better performance 
when queries return
              a lot of nodes set to 'false'.
            - resultFetchSize: The number of results the query handler should
              initially fetch when a query is executed.
              Default value: Integer.MAX_VALUE (->  all)
            - extractorPoolSize: defines the maximum number of background 
threads that are
              used to extract text from binary properties. If set to zero 
(default) no
              background threads are allocated and text extractors run in the 
current thread.
            - extractorTimeout: a text extractor is executed using a background 
thread if it
              doesn't finish within this timeout defined in milliseconds. This 
parameter has
              no effect if extractorPoolSize is zero.
            - extractorBackLogSize: the size of the extractor pool back log. If 
all threads in
              the pool are busy, incoming work is put into a wait queue. If the 
wait queue
              reaches the back log size incoming extractor work will not be 
queued anymore
              but will be executed with the current thread.
            - synonymProviderClass: the name of a class that implements
              org.apache.jackrabbit.core.query.lucene.SynonymProvider. The
              default value is null (->  not set).

            Note: all parameters (except path) in this SearchIndex config are 
default
            values and can be omitted.
        -->
        <SearchIndex 
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${rep.home}/index" />
            <param name="useCompoundFile" value="true" />
            <!-- Default is 100 -->
            <param name="minMergeDocs" value="500" />
            <param name="maxMergeDocs" value="10000" />
            <param name="volatileIdleTime" value="3" />
            <!-- Default is 10: more segments quicker the indexing but slower the 
searching -->
            <param name="mergeFactor" value="10" />
            <param name="maxFieldLength" value="10000" />
            <!-- Default is 10 -->
            <param name="bufferSize" value="100" />
            <param name="cacheSize" value="1000" />
            <param name="forceConsistencyCheck" value="false" />
            <param name="enableConsistencyCheck" value="true" />
            <param name="autoRepair" value="true" />
            <param name="analyzer" 
value="org.artifactory.search.lucene.ArtifactoryAnalyzer" />
            <param name="queryClass" 
value="org.apache.jackrabbit.core.query.QueryImpl" />
            <param name="respectDocumentOrder" value="false" />
            <param name="resultFetchSize" value="700" />
            <param name="supportHighlighting" value="true" />
            <param name="excerptProviderClass" 
value="org.artifactory.search.ArchiveEntriesXmlExcerpt" />

            <!--
            Use 5 background threads for text extraction that takes more than 
100 milliseconds
            -->
            <param name="extractorPoolSize" value="5" />
            <param name="extractorTimeout" value="100" />
            <!-- Default is 100 -->
            <param name="extractorBackLogSize" value="500" />
            <!-- Indexing configuration -->
            <param name="indexingConfiguration" 
value="${rep.home}/index/index_config.xml" />
            <!-- Workspace inconsistency handler -->
            <param name="onWorkspaceInconsistency" value="lenient" />
        </SearchIndex>

        <!-- http://issues.apache.org/jira/browse/JCR-314 -->
        <ISMLocking class="org.apache.jackrabbit.core.state.FineGrainedISMLocking" 
/>
    </Workspace>

    <!--
        Configures the versioning
    -->
    <Versioning rootPath="${rep.home}/version">
        <!--
            Configures the filesystem to use for versioning for the respective
            persistence manager
        -->
        <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
            <param name="path" value="${rep.home}/version" />
        </FileSystem>

        <!--
            Configures the persistence manager to be used for persisting 
version state.
            Please note that the current versioning implementation is based on
            a 'normal' persistence manager, but this could change in future
            implementations.
        -->
        <!--We do not use versioning-->
        <PersistenceManager 
class="org.apache.jackrabbit.core.persistence.mem.InMemPersistenceManager">
            <param name="persistent" value="false" />
        </PersistenceManager>
    </Versioning>

    <!-- Clustering configuration -->
    <!--
    <Cluster id="node1">
        <Journal class="org.apache.jackrabbit.core.journal.DatabaseJournal">
            <param name="revision" value="${rep.home}/revision.log"/>
            <param name="driver" value="com.mysql.jdbc.Driver"/>
            <param name="url"
                   
value="jdbc:mysql://localhost:3306/artifactory?useUnicode=true&amp;characterEncoding=UTF-8"/>
            <param name="user" value="artifactory_user"/>
            <param name="password" value="password"/>
        </Journal>
    </Cluster>
    -->

</Repository>




On Aug 1, 2010, at 10:36 AM, Noam Y. Tenne wrote:

OK, we will look into this.
Could you please send us the JCR repo.xml file which is being used? (you
may send it to [email protected], if you wish to send it privately).
Also, which version of Oracle are you using, and what is the approximate
size of the repository that was converted?

Thanks for reporting this

On 08/01/2010 05:22 PM, Jay Colson wrote:

it MOST DEFINITELY grows _way_ larger than 1GB (180 GB +) --- which is a real 
issue during an upgrade, when you can't just stop the server and clear the 
cache, because it just rebuilds it all over again.

j

On Aug 1, 2010, at 10:21 AM, Noam Y. Tenne wrote:

If by "cache" folder you mean the one within $ARTIFACTORY_HOME/data/,
then yes; you may remove it (but make sure to stop Artifactory before
doing so).
According to the default Oracle JCR configuration xml that's bundled
with Artifactory (can be found within
$ARTIFACTORY_HOME/etc/repo/oracle10/), the cache folder size limit
should be 1GB (see Repository->Datastore->blobsCacheMaxSize).
If the cache folder grows uncontrollably beyond that size, this may be a
bug.

On 08/01/2010 04:15 PM, Jay Colson wrote:

This ran for 5-6 hours. The cache was actually what wad growing out of
control. I had to stop the process once to add another 200gb disk
space. Which
Made the upgrade process take about 13 hours all together. Is it safe
to delink cache directories from the filesystem and let artifactory
recreate them?  How can the cache be better managed?

On Aug 1, 2010, at 8:49 AM, "Noam Y. Tenne"<[email protected]>    wrote:

Hi Jay,

Version 2.2 includes changes and improvements to the search capabilities
of Artifactory, hence the need for re-indexing and the growth of index
size.
The times it takes to re-index can vary from minutes and up to an hour,
depending on repository size, and the strength of host machine.
As of now, is it still indexing? If so, how large is your repository and
what are the specs of the server it runs on?

Noam

On 07/31/2010 08:21 AM, Jay Colson wrote:

Just upgraded to 2.5.5 from 2.0.8 and this thing has been indexing for hours.  
It's eating all my disk space (when my db is in oracle) -- Is there a way to 
just disable indexing for now?

Thanks,
j


------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Artifactory-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/artifactory-users

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Artifactory-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/artifactory-users

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Artifactory-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/artifactory-users

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Artifactory-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/artifactory-users

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Artifactory-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/artifactory-users


------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Artifactory-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/artifactory-users


------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Artifactory-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/artifactory-users

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm

_______________________________________________
Artifactory-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/artifactory-users

Re: [Artifactory-users] (o.a.j.c.q.l.MultiIndex:1218) - 210000 nodes indexed... - taking waaaaay too much space

Reply via email to