imply-cheddar commented on code in PR #13394:
URL: https://github.com/apache/druid/pull/13394#discussion_r1040461494
##########
docs/dependencies/deep-storage.md:
##########
@@ -25,24 +25,47 @@ title: "Deep storage"
Deep storage is where segments are stored. It is a storage mechanism that
Apache Druid does not provide. This deep storage infrastructure defines the
level of durability of your data, as long as Druid processes can see this
storage infrastructure and get at the segments stored on it, you will not lose
data no matter how many Druid nodes you lose. If segments disappear from this
storage layer, then you will lose whatever data those segments represented.
-## Local Mount
+## Local
-A local mount can be used for storage of segments as well. This allows you to
use just your local file system or anything else that can be mount locally like
NFS, Ceph, etc. This is the default deep storage implementation.
+Local storage is intended for use in the following situations:
-In order to use a local mount for deep storage, you need to set the following
configuration in your common configs.
+- You have just one server.
+- Or, you have multiple servers, and they all have access to a shared
filesystem (for example: NFS).
+
+In order to use a local directory for deep storage, set the following
configuration in `common.runtime.properties`.
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
-|`druid.storage.type`|local||Must be set.|
-|`druid.storage.storageDirectory`||Directory for storing segments.|Must be
set.|
+|`druid.storage.type`|`local`||Must be set.|
+|`druid.storage.storageDirectory`|any local directory|Directory for storing
segments. Must be different from `druid.segmentCache.locations` and
`druid.segmentCache.infoDir`.|`/tmp/druid/localStorage`|
+|`druid.storage.zip`|`true`, `false`|Whether segments in
`druid.storage.storageDirectory` are written as directories (`false`) or zip
files (`true`).|`false`|
Review Comment:
Personally, I'd be down for just making the segment storage stop zipping up
the files. It'd be good for a wide range of things, I think. The only risk is
that it might create extra data on deep storage for people, but I'm not sure
that's really the top-of-mind concern for most people.
##########
server/src/main/java/org/apache/druid/segment/loading/LocalDataSegmentPusher.java:
##########
@@ -106,27 +130,61 @@ public DataSegment pushToPath(File dataSegmentFile,
DataSegment segment, String
throw new IOE("Failed to rename [%s] to [%s]", tmpIndexFile,
indexFileTarget);
}
- return dataSegment;
+ return baseSegment.withLoadSpec(makeLoadSpec(new File(outDir,
INDEX_ZIP_FILENAME).toURI()))
+ .withSize(size);
}
finally {
- FileUtils.deleteDirectory(tmpOutDir);
+ FileUtils.deleteDirectory(tmpSegmentDir);
}
}
- @Override
- public Map<String, Object> makeLoadSpec(URI finalIndexZipFilePath)
+ private DataSegment pushNoZip(final File inDir, final File outDir, final
DataSegment baseSegment) throws IOException
Review Comment:
I'm just curious about the pushing. If the push creates a hard link to an
ephemeral location that belonged to the task and then the task directory is
deleted, what happens to that hard link? Does the file continue to exist and
only be linked from one location?
##########
docs/dependencies/deep-storage.md:
##########
@@ -25,24 +25,47 @@ title: "Deep storage"
Deep storage is where segments are stored. It is a storage mechanism that
Apache Druid does not provide. This deep storage infrastructure defines the
level of durability of your data, as long as Druid processes can see this
storage infrastructure and get at the segments stored on it, you will not lose
data no matter how many Druid nodes you lose. If segments disappear from this
storage layer, then you will lose whatever data those segments represented.
-## Local Mount
+## Local
-A local mount can be used for storage of segments as well. This allows you to
use just your local file system or anything else that can be mount locally like
NFS, Ceph, etc. This is the default deep storage implementation.
+Local storage is intended for use in the following situations:
-In order to use a local mount for deep storage, you need to set the following
configuration in your common configs.
+- You have just one server.
+- Or, you have multiple servers, and they all have access to a shared
filesystem (for example: NFS).
+
+In order to use a local directory for deep storage, set the following
configuration in `common.runtime.properties`.
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
-|`druid.storage.type`|local||Must be set.|
-|`druid.storage.storageDirectory`||Directory for storing segments.|Must be
set.|
+|`druid.storage.type`|`local`||Must be set.|
+|`druid.storage.storageDirectory`|any local directory|Directory for storing
segments. Must be different from `druid.segmentCache.locations` and
`druid.segmentCache.infoDir`.|`/tmp/druid/localStorage`|
+|`druid.storage.zip`|`true`, `false`|Whether segments in
`druid.storage.storageDirectory` are written as directories (`false`) or zip
files (`true`).|`false`|
+
+For example:
+
+```
+druid.storage.type=local
+druid.storage.storageDirectory=/tmp/druid/localStorage
+```
+
+When using local storage, it is recommended to use the same filesystem for
`druid.storage.type` and
+`druid.segmentCache.locations` on your Historicals. When these are on the same
filesystem, and
+`druid.storage.zip = false` (the default), Druid creates hard links between
the two directories rather than storing
+two copies of the same data.
+
+Local storage cannot be used if you have multiple servers without a shared
filesystem. In this case, consider using
Review Comment:
Consider "Local storage requires a shared file system when used across
multiple servers."
##########
docs/dependencies/deep-storage.md:
##########
@@ -25,24 +25,47 @@ title: "Deep storage"
Deep storage is where segments are stored. It is a storage mechanism that
Apache Druid does not provide. This deep storage infrastructure defines the
level of durability of your data, as long as Druid processes can see this
storage infrastructure and get at the segments stored on it, you will not lose
data no matter how many Druid nodes you lose. If segments disappear from this
storage layer, then you will lose whatever data those segments represented.
-## Local Mount
+## Local
-A local mount can be used for storage of segments as well. This allows you to
use just your local file system or anything else that can be mount locally like
NFS, Ceph, etc. This is the default deep storage implementation.
+Local storage is intended for use in the following situations:
-In order to use a local mount for deep storage, you need to set the following
configuration in your common configs.
+- You have just one server.
+- Or, you have multiple servers, and they all have access to a shared
filesystem (for example: NFS).
+
+In order to use a local directory for deep storage, set the following
configuration in `common.runtime.properties`.
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
-|`druid.storage.type`|local||Must be set.|
-|`druid.storage.storageDirectory`||Directory for storing segments.|Must be
set.|
+|`druid.storage.type`|`local`||Must be set.|
+|`druid.storage.storageDirectory`|any local directory|Directory for storing
segments. Must be different from `druid.segmentCache.locations` and
`druid.segmentCache.infoDir`.|`/tmp/druid/localStorage`|
+|`druid.storage.zip`|`true`, `false`|Whether segments in
`druid.storage.storageDirectory` are written as directories (`false`) or zip
files (`true`).|`false`|
+
+For example:
+
+```
+druid.storage.type=local
+druid.storage.storageDirectory=/tmp/druid/localStorage
+```
+
+When using local storage, it is recommended to use the same filesystem for
`druid.storage.type` and
+`druid.segmentCache.locations` on your Historicals. When these are on the same
filesystem, and
+`druid.storage.zip = false` (the default), Druid creates hard links between
the two directories rather than storing
+two copies of the same data.
Review Comment:
Personally, I wish that this would call out that manual setting of local
storage in a production environment is an "advanced feature" that should be
done intentionally. Or maybe just a first sentence that pushes for deep
storage more. Something like
```
Use of a cloud store is recommended for deep storage in any production
system. If you know what you are doing and are using local storage, consider
setting the same filesystem ...
```
I know the paragraph below calls out the cloud stuff as well, I don't think
it hurts to mention it multiple times.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]